[Nsi-wg] Call Tomorrow and Agenda

Wed Feb 8 09:10:16 EST 2012

Slides for today.

On 2012-02-08, at 2:59 AM, Inder Monga wrote:

> 
> Hi all,
> 
> The following is dial-in information for Wednesday's NSI call, time:  7:00 PDT  10:00 EDT, 15:00 GMT,  16:00 CET,  24:00 JST
> 
> 1. Dial Toll-Free Number: 866-740-1260 (U.S. & Canada) 2. International participants dial: Toll Number: 303-248-0285  Or International Toll-Free Number: http://www.readytalk.com/intl 3. Enter 7-digit access code  8937606, followed by “#”
> 
> Agenda:
> 
> 1. Firewall issues: John Macauley
> 2. Error Handling: Henrik
> 3. Other topics
> 
> Thanks
> Inder
> 
> 
> Henrik's email attached:
> 
> -- 
> 
> 
> Failure scenarios and recovery for the NSI protocol version 1.0 and 1.1 
> 
> == Introduction == 
> 
> The main focus will be on the control plane interaction, and how to deal with 
> message loss, crashes, and how to recover from them. 
> 
> With the exception of the forcedEnd primitives, all NSI control plane 
> message interactions happens like this: 
> 
> Requester NSA               Provider NSA 
> 
> operation               -> 
>                         <-  operation received ack 
>                         <-  operation result 
> operation result ack    -> 
> 
> The main idea between the separation of the operation and operation result is 
> that they may be separated by a significant time, especially the provision 
> operation which can be separated with several days or months between the 
> operation and the result. 
> 
> For failure scenarios, the loss of any of the four messages should be 
> considered along with crashes of one or both of the NSAs at any point in time. 
> These failure scenarios can be generalized into the availability of an NSA, 
> i.e., it does not matter if it is the network or NSA that is down, the 
> distinction is if the NSA received the message or not. 
> 
> In general the problem is to ensure that the (intended) state of a connection 
> is kept in sync. There are two significant problems in the current protocol: 
> 
> * No clear semantics for the operation received ack 
> * No clear division of responsibility between requester and provider 
> 
> Both of these are semantic issues (i.e., behavior), and hence solving them 
> should not require any changes for the wire-protocol. 
> 
> From a theoretical point of view and assuming an asynchronous network model 
> (note that async means something else in distributed systems than in networks) 
> the problem is impossible to solve. Taking a slightly less pessimistic view 
> (i.e., a partial synchronous network model), it becomes possible to recover 
> some failures. Taking a pragmatic approach most errors are recoverable, given 
> that the network and NSAs becomes functional at some point in time. 
> 
> 
> == Control Plane Failure Scenarios & Recovery == 
> 
> The following will go through a range of failure scenarios, and describe how to 
> recover from them. Note that some of the scenarios can be solved in multiple 
> ways. I've taken the approach that it is the responsibility of the requester to 
> ensure that the connection at the at the provider is in the required state. 
> 
> A: Requester NSA did not receive the operation received ack. 
> 
> Note: This failure is equivalent to not being able to dispatch the message 
>       (here there failure just occurs earlier). 
> 
> Note: If the operation result message is received within the timeout, this case 
>       can be ignored. 
> 
> Potential causes: Message loss, network outage, provider NSA is down 
> 
> If the requester NSA after a certain amount of time have not received the 
> operation received ack it must assume that the connection cannot be created or 
> the state change. This can be dealt with in multiple ways: 
> 
>   1. Do nothing (hope it comes up again) 
>   2: An alternative circuit can be found. 
>   3: Tear down the connection and send operation failure up the tree. 
> 
> Which strategy to choose here is policy dependent and is up the individual 
> implementation and organization. OpenNSA currently does 3. 
> 
> For the sake of preventing stale connections, the requester can keep a list of 
> "dead" connections. The status of these connections can then be checked at 
> certain intervals via query and a control primitive for fixing the status send 
> if needed. 
> 
> 
> B: Provider NSA could not deliver the operation received ack 
> 
> This situation is a special case of scenario A, but seen from the provider 
> point of view. 
> 
> Repeated delivery attempts can be tried, but this an only an incremental 
> improvement/optimization and does affect the end result. 
> 
> The provider should not try and change the state of connection, besides the 
> latest received primitive from the requester (do the least surprising thing). 
> It is up to the requester to discover the current state (via query) and change 
> it if needed. 
> 
> Since it is the responsibility of the requester to discover the state, there is 
> no need for the provider to perform "reverse query". In fact, using the reverse 
> query, for connection state update may cause more harm than good, as having the 
> provider change the connection status automatically may not be what the 
> provider wants (he might have compensated somehow) and does not follow 
> the element of least surprise, and leaves the control of the connection at two 
> parties. 
> 
> Alternatively, a "Hi, I'm alive; sorry for the downtime" primitive be 
> introduced from provider to requester, which the requester can then use to fire 
> off any controlling primitives. This is, however, just an optimization. 
> 
> 
> C: Provider NSA could not deliver the operation result message 
> 
> This case should be handled as described in scenario B. 
> 
> 
> D: Requester NSA did not receive the operation result message. 
> 
> This case should be handled as described in scenario A. 
> 
> 
> E: Operation result ack was not received. 
> 
> This case should be handled as described in scenario B. 
> 
> 
> == Data Plane Failure Scenarios & Recovery == 
> 
> Data plane failures are somewhat different from control plane failures. I am 
> not well-versed in networking and NRMs, but will try to come up with a 
> strategy: 
> 
> In general, I see two sorts of failures: 
> 
>   1. The failure is happening in my local domain. 
>   2. The failure is happening outside my local domain. 
> 
> This might be an overly simplistic view of things. 
> 
> We assume that any fail-over, etc. have also failed, so the failure cannot be 
> corrected (if it can be corrected quickly, it probably should). 
> 
> The further handling of a data plane failure will probably be policy dependent. 
> For some users the network, might be completely unusable after a failure, 
> where some would like to try and have it repaired. However trying to decide / 
> figure out where and how this policy should be enforced is a rather tricky 
> process, and probably out of scope of NSI for now. 
> 
> Instead I would suggest sending terminate messages downwards and forcedEnd 
> upwards. Once this propagates to the initial requester a policy-correct action 
> can be taken. I.e., convert a data-plane failure into a control-plane issue. 
> 
> 
> == Recommendation / Action items == 
> 
> * Make the exact semantics of the operation received ack clear 
> 
> Recommendation: 
> - The message has been received (duh) 
> - The request is sane 
> - The request has been serialized (crash safe). 
> - The specified connection exists (for provision, release, terminate) 
> - The request was authorized 
> 
>   This has the following implication: 
>   - Once the operation received ack has been received by the requester, 
>     the connection should show up on a query result. If we cannot expect 
>     the connection to show after the receival, the primitive should be 
>     removed as it has no semantic value. 
>   - Failing early will save message exchanges and time. 
> 
> * Make it clear which of the NSAs has the responsibility for what 
> 
> Recommendation: 
> - The provider is the authority for connection status (duh) 
> - Keeping connection state synchronized is the responsibility of the requester 
> 
>   This has the following implication: 
>   - Any (non-scheduled) connection state change must only be done at the 
>     initiative of the requester 
>   - The requester query interface is not needed. 
> 
> --  
> 
> -- 
> Inder Monga
> 510-486-6531
> http://www.es.net
> Follow us on Twitter: ESnetUpdates/Twitter 
> Visit our blog: ESnetUpdates Blog
> 
> _______________________________________________
> nsi-wg mailing list
> nsi-wg at ogf.org
> https://www.ogf.org/mailman/listinfo/nsi-wg

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/nsi-wg/attachments/20120208/329dc2cf/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: NSI-firewall issues-v1.pdf
Type: application/pdf
Size: 1414400 bytes
Desc: not available
URL: <http://www.ogf.org/pipermail/nsi-wg/attachments/20120208/329dc2cf/attachment-0001.pdf>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/nsi-wg/attachments/20120208/329dc2cf/attachment-0003.html>