[Nsi-wg] Call Tomorrow and Agenda

Thu Feb 9 05:20:13 EST 2012

John

Can you please send the minutes of the discussion and the next actions?

For the list members:

There was a proposal to discuss the NSI-Client (or UNI) or whatever we 
want to name the interface between an application and NSA agent during 
the march OGF face to face meeting. 2 hour timeslot should be enough. 
The question is does every software application implementing the NSA 
interface has to follow all the rules of NSA including the state machine 
of the RA?

For next week's discussion - please read Henrik's email on error 
handling below.

Thanks
Inder

John MacAuley wrote:
> Slides for today.
>
>
> On 2012-02-08, at 2:59 AM, Inder Monga wrote:
>
>>
>> Hi all,
>>
>> The following is dial-in information for Wednesday's NSI call, time: 
>> 7:00 PDT 10:00 EDT, 15:00 GMT, 16:00 CET,  24:00 JST
>>
>> 1. Dial Toll-Free Number: 866-740-1260 (U.S. & Canada) 2. 
>> International participants dial: Toll Number: 303-248-0285  Or 
>> International Toll-Free Number: http://www.readytalk.com/intl 3. 
>> Enter 7-digit access code  8937606, followed by "#"
>>
>> Agenda:
>>
>> 1. Firewall issues: John Macauley
>> 2. Error Handling: Henrik
>> 3. Other topics
>>
>> Thanks
>> Inder
>>
>>
>> Henrik's email attached:
>>
>> -- 
>>
>>
>> Failure scenarios and recovery for the NSI protocol version 1.0 and 1.1
>>
>> == Introduction ==
>>
>> The main focus will be on the control plane interaction, and how to 
>> deal with
>> message loss, crashes, and how to recover from them.
>>
>> With the exception of the forcedEnd primitives, all NSI control plane
>> message interactions happens like this:
>>
>> Requester NSA               Provider NSA
>>
>> operation               ->
>> <-  operation received ack
>> <-  operation result
>> operation result ack    ->
>>
>> The main idea between the separation of the operation and operation 
>> result is
>> that they may be separated by a significant time, especially the 
>> provision
>> operation which can be separated with several days or months between the
>> operation and the result.
>>
>> For failure scenarios, the loss of any of the four messages should be
>> considered along with crashes of one or both of the NSAs at any point 
>> in time.
>> These failure scenarios can be generalized into the availability of 
>> an NSA,
>> i.e., it does not matter if it is the network or NSA that is down, the
>> distinction is if the NSA received the message or not.
>>
>> In general the problem is to ensure that the (intended) state of a 
>> connection
>> is kept in sync. There are two significant problems in the current 
>> protocol:
>>
>> * No clear semantics for the operation received ack
>> * No clear division of responsibility between requester and provider
>>
>> Both of these are semantic issues (i.e., behavior), and hence solving 
>> them
>> should not require any changes for the wire-protocol.
>>
>> From a theoretical point of view and assuming an asynchronous network 
>> model
>> (note that async means something else in distributed systems than in 
>> networks)
>> the problem is impossible to solve. Taking a slightly less 
>> pessimistic view
>> (i.e., a partial synchronous network model), it becomes possible to 
>> recover
>> some failures. Taking a pragmatic approach most errors are 
>> recoverable, given
>> that the network and NSAs becomes functional at some point in time.
>>
>>
>> == Control Plane Failure Scenarios & Recovery ==
>>
>> The following will go through a range of failure scenarios, and 
>> describe how to
>> recover from them. Note that some of the scenarios can be solved in 
>> multiple
>> ways. I've taken the approach that it is the responsibility of the 
>> requester to
>> ensure that the connection at the at the provider is in the required 
>> state.
>>
>> A: Requester NSA did not receive the operation received ack.
>>
>> Note: This failure is equivalent to not being able to dispatch the 
>> message
>>       (here there failure just occurs earlier).
>>
>> Note: If the operation result message is received within the timeout, 
>> this case
>>       can be ignored.
>>
>> Potential causes: Message loss, network outage, provider NSA is down
>>
>> If the requester NSA after a certain amount of time have not received 
>> the
>> operation received ack it must assume that the connection cannot be 
>> created or
>> the state change. This can be dealt with in multiple ways:
>>
>>   1. Do nothing (hope it comes up again)
>>   2: An alternative circuit can be found.
>>   3: Tear down the connection and send operation failure up the tree.
>>
>> Which strategy to choose here is policy dependent and is up the 
>> individual
>> implementation and organization. OpenNSA currently does 3.
>>
>> For the sake of preventing stale connections, the requester can keep 
>> a list of
>> "dead" connections. The status of these connections can then be 
>> checked at
>> certain intervals via query and a control primitive for fixing the 
>> status send
>> if needed.
>>
>>
>> B: Provider NSA could not deliver the operation received ack
>>
>> This situation is a special case of scenario A, but seen from the 
>> provider
>> point of view.
>>
>> Repeated delivery attempts can be tried, but this an only an incremental
>> improvement/optimization and does affect the end result.
>>
>> The provider should not try and change the state of connection, 
>> besides the
>> latest received primitive from the requester (do the least surprising 
>> thing).
>> It is up to the requester to discover the current state (via query) 
>> and change
>> it if needed.
>>
>> Since it is the responsibility of the requester to discover the 
>> state, there is
>> no need for the provider to perform "reverse query". In fact, using 
>> the reverse
>> query, for connection state update may cause more harm than good, as 
>> having the
>> provider change the connection status automatically may not be what the
>> provider wants (he might have compensated somehow) and does not follow
>> the element of least surprise, and leaves the control of the 
>> connection at two
>> parties.
>>
>> Alternatively, a "Hi, I'm alive; sorry for the downtime" primitive be
>> introduced from provider to requester, which the requester can then 
>> use to fire
>> off any controlling primitives. This is, however, just an optimization.
>>
>>
>> C: Provider NSA could not deliver the operation result message
>>
>> This case should be handled as described in scenario B.
>>
>>
>> D: Requester NSA did not receive the operation result message.
>>
>> This case should be handled as described in scenario A.
>>
>>
>> E: Operation result ack was not received.
>>
>> This case should be handled as described in scenario B.
>>
>>
>> == Data Plane Failure Scenarios & Recovery ==
>>
>> Data plane failures are somewhat different from control plane 
>> failures. I am
>> not well-versed in networking and NRMs, but will try to come up with a
>> strategy:
>>
>> In general, I see two sorts of failures:
>>
>>   1. The failure is happening in my local domain.
>>   2. The failure is happening outside my local domain.
>>
>> This might be an overly simplistic view of things.
>>
>> We assume that any fail-over, etc. have also failed, so the failure 
>> cannot be
>> corrected (if it can be corrected quickly, it probably should).
>>
>> The further handling of a data plane failure will probably be policy 
>> dependent.
>> For some users the network, might be completely unusable after a 
>> failure,
>> where some would like to try and have it repaired. However trying to 
>> decide /
>> figure out where and how this policy should be enforced is a rather 
>> tricky
>> process, and probably out of scope of NSI for now.
>>
>> Instead I would suggest sending terminate messages downwards and 
>> forcedEnd
>> upwards. Once this propagates to the initial requester a 
>> policy-correct action
>> can be taken. I.e., convert a data-plane failure into a control-plane 
>> issue.
>>
>>
>> == Recommendation / Action items ==
>>
>> * Make the exact semantics of the operation received ack clear
>>
>> Recommendation:
>> - The message has been received (duh)
>> - The request is sane
>> - The request has been serialized (crash safe).
>> - The specified connection exists (for provision, release, terminate)
>> - The request was authorized
>>
>>   This has the following implication:
>>   - Once the operation received ack has been received by the requester,
>>     the connection should show up on a query result. If we cannot expect
>>     the connection to show after the receival, the primitive should be
>>     removed as it has no semantic value.
>>   - Failing early will save message exchanges and time.
>>
>> * Make it clear which of the NSAs has the responsibility for what
>>
>> Recommendation:
>> - The provider is the authority for connection status (duh)
>> - Keeping connection state synchronized is the responsibility of the 
>> requester
>>
>>   This has the following implication:
>>   - Any (non-scheduled) connection state change must only be done at the
>>     initiative of the requester
>>   - The requester query interface is not needed.
>>
>> -- 
>>
>> -- 
>> Inder Monga
>> 510-486-6531
>> http://www.es.net <http://www.es.net/>
>> Follow us on Twitter: ESnetUpdates/Twitter <http://bit.ly/bisCAd>
>> Visit our blog: ESnetUpdates Blog 
>> <http://bit.ly/9lSTO3><http://bit.ly/d2Olql>
>>
>> _______________________________________________
>> nsi-wg mailing list
>> nsi-wg at ogf.org <mailto:nsi-wg at ogf.org>
>> https://www.ogf.org/mailman/listinfo/nsi-wg
>

Inder Monga
Tel: | Mobile
imonga at es.net|

<http://www.linkedin.com/in/indermonga>
Get a signature like this. 
<http://r1.wisestamp.com/r/landing?promo=18&dest=http%3A%2F%2Fwww.wisestamp.com%2Femail-install%3Futm_source%3Dextension%26utm_medium%3Demail%26utm_campaign%3Dpromo_18> 
Click here. 
<http://r1.wisestamp.com/r/landing?promo=18&dest=http%3A%2F%2Fwww.wisestamp.com%2Femail-install%3Futm_source%3Dextension%26utm_medium%3Demail%26utm_campaign%3Dpromo_18> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/nsi-wg/attachments/20120209/7553c505/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: linkedinbutton_option_1.png
Type: image/png
Size: 3685 bytes
Desc: not available
URL: <http://www.ogf.org/pipermail/nsi-wg/attachments/20120209/7553c505/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: p.gif
Type: image/gif
Size: 35 bytes
Desc: not available
URL: <http://www.ogf.org/pipermail/nsi-wg/attachments/20120209/7553c505/attachment-0001.gif>