[Nsi-wg] Call Tomorrow and Agenda

Wed Feb 8 02:59:49 EST 2012

Hi all,

The following is dial-in information for Wednesday's NSI call, time: 
7:00 PDT 10:00 EDT, 15:00 GMT, 16:00 CET,  24:00 JST

1. Dial Toll-Free Number: 866-740-1260 (U.S. & Canada) 2. International 
participants dial: Toll Number: 303-248-0285  Or International Toll-Free 
Number: http://www.readytalk.com/intl 3. Enter 7-digit access code  
8937606, followed by "#"

Agenda:

1. Firewall issues: John Macauley
2. Error Handling: Henrik
3. Other topics

Thanks
Inder

Henrik's email attached:

-- 

Failure scenarios and recovery for the NSI protocol version 1.0 and 1.1

== Introduction ==

The main focus will be on the control plane interaction, and how to deal 
with
message loss, crashes, and how to recover from them.

With the exception of the forcedEnd primitives, all NSI control plane
message interactions happens like this:

Requester NSA               Provider NSA

operation               ->
<-  operation received ack
<-  operation result
operation result ack    ->

The main idea between the separation of the operation and operation 
result is
that they may be separated by a significant time, especially the provision
operation which can be separated with several days or months between the
operation and the result.

For failure scenarios, the loss of any of the four messages should be
considered along with crashes of one or both of the NSAs at any point in 
time.
These failure scenarios can be generalized into the availability of an NSA,
i.e., it does not matter if it is the network or NSA that is down, the
distinction is if the NSA received the message or not.

In general the problem is to ensure that the (intended) state of a 
connection
is kept in sync. There are two significant problems in the current 
protocol:

* No clear semantics for the operation received ack
* No clear division of responsibility between requester and provider

Both of these are semantic issues (i.e., behavior), and hence solving them
should not require any changes for the wire-protocol.

 From a theoretical point of view and assuming an asynchronous network 
model
(note that async means something else in distributed systems than in 
networks)
the problem is impossible to solve. Taking a slightly less pessimistic view
(i.e., a partial synchronous network model), it becomes possible to recover
some failures. Taking a pragmatic approach most errors are recoverable, 
given
that the network and NSAs becomes functional at some point in time.

== Control Plane Failure Scenarios & Recovery ==

The following will go through a range of failure scenarios, and describe 
how to
recover from them. Note that some of the scenarios can be solved in 
multiple
ways. I've taken the approach that it is the responsibility of the 
requester to
ensure that the connection at the at the provider is in the required state.

A: Requester NSA did not receive the operation received ack.

Note: This failure is equivalent to not being able to dispatch the message
       (here there failure just occurs earlier).

Note: If the operation result message is received within the timeout, 
this case
       can be ignored.

Potential causes: Message loss, network outage, provider NSA is down

If the requester NSA after a certain amount of time have not received the
operation received ack it must assume that the connection cannot be 
created or
the state change. This can be dealt with in multiple ways:

   1. Do nothing (hope it comes up again)
   2: An alternative circuit can be found.
   3: Tear down the connection and send operation failure up the tree.

Which strategy to choose here is policy dependent and is up the individual
implementation and organization. OpenNSA currently does 3.

For the sake of preventing stale connections, the requester can keep a 
list of
"dead" connections. The status of these connections can then be checked at
certain intervals via query and a control primitive for fixing the 
status send
if needed.

B: Provider NSA could not deliver the operation received ack

This situation is a special case of scenario A, but seen from the provider
point of view.

Repeated delivery attempts can be tried, but this an only an incremental
improvement/optimization and does affect the end result.

The provider should not try and change the state of connection, besides the
latest received primitive from the requester (do the least surprising 
thing).
It is up to the requester to discover the current state (via query) and 
change
it if needed.

Since it is the responsibility of the requester to discover the state, 
there is
no need for the provider to perform "reverse query". In fact, using the 
reverse
query, for connection state update may cause more harm than good, as 
having the
provider change the connection status automatically may not be what the
provider wants (he might have compensated somehow) and does not follow
the element of least surprise, and leaves the control of the connection 
at two
parties.

Alternatively, a "Hi, I'm alive; sorry for the downtime" primitive be
introduced from provider to requester, which the requester can then use 
to fire
off any controlling primitives. This is, however, just an optimization.

C: Provider NSA could not deliver the operation result message

This case should be handled as described in scenario B.

D: Requester NSA did not receive the operation result message.

This case should be handled as described in scenario A.

E: Operation result ack was not received.

This case should be handled as described in scenario B.

== Data Plane Failure Scenarios & Recovery ==

Data plane failures are somewhat different from control plane failures. 
I am
not well-versed in networking and NRMs, but will try to come up with a
strategy:

In general, I see two sorts of failures:

   1. The failure is happening in my local domain.
   2. The failure is happening outside my local domain.

This might be an overly simplistic view of things.

We assume that any fail-over, etc. have also failed, so the failure 
cannot be
corrected (if it can be corrected quickly, it probably should).

The further handling of a data plane failure will probably be policy 
dependent.
For some users the network, might be completely unusable after a failure,
where some would like to try and have it repaired. However trying to 
decide /
figure out where and how this policy should be enforced is a rather tricky
process, and probably out of scope of NSI for now.

Instead I would suggest sending terminate messages downwards and forcedEnd
upwards. Once this propagates to the initial requester a policy-correct 
action
can be taken. I.e., convert a data-plane failure into a control-plane 
issue.

== Recommendation / Action items ==

* Make the exact semantics of the operation received ack clear

Recommendation:
- The message has been received (duh)
- The request is sane
- The request has been serialized (crash safe).
- The specified connection exists (for provision, release, terminate)
- The request was authorized

   This has the following implication:
   - Once the operation received ack has been received by the requester,
     the connection should show up on a query result. If we cannot expect
     the connection to show after the receival, the primitive should be
     removed as it has no semantic value.
   - Failing early will save message exchanges and time.

* Make it clear which of the NSAs has the responsibility for what

Recommendation:
- The provider is the authority for connection status (duh)
- Keeping connection state synchronized is the responsibility of the 
requester

   This has the following implication:
   - Any (non-scheduled) connection state change must only be done at the
     initiative of the requester
   - The requester query interface is not needed.

-- 

-- 
Inder Monga
510-486-6531
http://www.es.net
Follow us on Twitter: ESnetUpdates/Twitter <http://bit.ly/bisCAd>
Visit our blog: ESnetUpdates Blog 
<http://bit.ly/9lSTO3><http://bit.ly/d2Olql>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/nsi-wg/attachments/20120207/cf15de63/attachment.html>