[Nsi-wg] NSI failure handling

Fri Jan 20 04:35:33 EST 2012

Hi everone

I've finished up the first draft of the describing failure scenarios. 
I've tried to keep it fairly high-level instead of trying to obsess over 
all possible corner scenarios.

The main contribution is a recommendation for semantics of the initial 
reply, and a clear division of responsibility between provider and 
requester, wrt. to changing connection state. It is likely that not 
everyone will agree with these, but at least we can get a discussion 
started, and trying to formalize things a bit (which is necessary if we 
want predictable failure handling).

--

Failure scenarios and recovery for the NSI protocol version 1.0 and 1.1

== Introduction ==

The main focus will be on the control plane interaction, and how to deal with
message loss, crashes, and how to recover from them.

With the exception of the forcedEnd primitives, all NSI control plane
message interactions happens like this:

Requester NSA               Provider NSA

operation               ->
                         <-  operation received ack
                         <-  operation result
operation result ack    ->

The main idea between the separation of the operation and operation result is
that they may be separated by a significant time, especially the provision
operation which can be separated with several days or months between the
operation and the result.

For failure scenarios, the loss of any of the four messages should be
considered along with crashes of one or both of the NSAs at any point in time.
These failure scenarios can be generalized into the availability of an NSA,
i.e., it does not matter if it is the network or NSA that is down, the
distinction is if the NSA received the message or not.

In general the problem is to ensure that the (intended) state of a connection
is kept in sync. There are two significant problems in the current protocol:

* No clear semantics for the operation received ack
* No clear division of responsibility between requester and provider

Both of these are semantic issues (i.e., behavior), and hence solving them
should not require any changes for the wire-protocol.

>From a theoretical point of view and assuming an asynchronous network model
(note that async means something else in distributed systems than in networks)
the problem is impossible to solve. Taking a slightly less pessimistic view
(i.e., a partial synchronous network model), it becomes possible to recover
some failures. Taking a pragmatic approach most errors are recoverable, given
that the network and NSAs becomes functional at some point in time.

== Control Plane Failure Scenarios & Recovery ==

The following will go through a range of failure scenarios, and describe how to
recover from them. Note that some of the scenarios can be solved in multiple
ways. I've taken the approach that it is the responsibility of the requester to
ensure that the connection at the at the provider is in the required state.

A: Requester NSA did not receive the operation received ack.

Note: This failure is equivalent to not being able to dispatch the message
       (here there failure just occurs earlier).

Note: If the operation result message is received within the timeout, this case
       can be ignored.

Potential causes: Message loss, network outage, provider NSA is down

If the requester NSA after a certain amount of time have not received the
operation received ack it must assume that the connection cannot be created or
the state change. This can be dealt with in multiple ways:

   1. Do nothing (hope it comes up again)
   2: An alternative circuit can be found.
   3: Tear down the connection and send operation failure up the tree.

Which strategy to choose here is policy dependent and is up the individual
implementation and organization. OpenNSA currently does 3.

For the sake of preventing stale connections, the requester can keep a list of
"dead" connections. The status of these connections can then be checked at
certain intervals via query and a control primitive for fixing the status send
if needed.

B: Provider NSA could not deliver the operation received ack

This situation is a special case of scenario A, but seen from the provider
point of view.

Repeated delivery attempts can be tried, but this an only an incremental
improvement/optimization and does affect the end result.

The provider should not try and change the state of connection, besides the
latest received primitive from the requester (do the least surprising thing).
It is up to the requester to discover the current state (via query) and change
it if needed.

Since it is the responsibility of the requester to discover the state, there is
no need for the provider to perform "reverse query". In fact, using the reverse
query, for connection state update may cause more harm than good, as having the
provider change the connection status automatically may not be what the
provider wants (he might have compensated somehow) and does not follow
the element of least surprise, and leaves the control of the connection at two
parties.

Alternatively, a "Hi, I'm alive; sorry for the downtime" primitive be
introduced from provider to requester, which the requester can then use to fire
off any controlling primitives. This is, however, just an optimization.

C: Provider NSA could not deliver the operation result message

This case should be handled as described in scenario B.

D: Requester NSA did not receive the operation result message.

This case should be handled as described in scenario A.

E: Operation result ack was not received.

This case should be handled as described in scenario B.

== Data Plane Failure Scenarios & Recovery ==

Data plane failures are somewhat different from control plane failures. I am
not well-versed in networking and NRMs, but will try to come up with a
strategy:

In general, I see two sorts of failures:

   1. The failure is happening in my local domain.
   2. The failure is happening outside my local domain.

This might be an overly simplistic view of things.

We assume that any fail-over, etc. have also failed, so the failure cannot be
corrected (if it can be corrected quickly, it probably should).

The further handling of a data plane failure will probably be policy dependent.
For some users the network, might be completely unusable after a failure,
where some would like to try and have it repaired. However trying to decide /
figure out where and how this policy should be enforced is a rather tricky
process, and probably out of scope of NSI for now.

Instead I would suggest sending terminate messages downwards and forcedEnd
upwards. Once this propagates to the initial requester a policy-correct action
can be taken. I.e., convert a data-plane failure into a control-plane issue.

== Recommendation / Action items ==

* Make the exact semantics of the operation received ack clear

Recommendation:
- The message has been received (duh)
- The request is sane
- The request has been serialized (crash safe).
- The specified connection exists (for provision, release, terminate)
- The request was authorized

   This has the following implication:
   - Once the operation received ack has been received by the requester,
     the connection should show up on a query result. If we cannot expect
     the connection to show after the receival, the primitive should be
     removed as it has no semantic value.
   - Failing early will save message exchanges and time.

* Make it clear which of the NSAs has the responsibility for what

Recommendation:
- The provider is the authority for connection status (duh)
- Keeping connection state synchronized is the responsibility of the requester

   This has the following implication:
   - Any (non-scheduled) connection state change must only be done at the
     initiative of the requester
   - The requester query interface is not needed.

--

     Best regards, Henrik

  Henrik Thostrup Jensen <htj at nordu.net>
  NORDUnet