[Nsi-wg] NSI error handling draft

Tue Apr 27 17:13:56 CDT 2010

On Apr 26, 2010, at 10:34 AM, John Vollbrecht wrote:

> 
> On Apr 22, 2010, at 7:11 PM, Inder Monga wrote:
> 
>> John,
>> 
>> If I may add my $0.02 cents.
>> 
>> On Apr 22, 2010, at 2:03 PM, John Vollbrecht wrote:
>> 
>>> This is very nice.
>>> 
>>> A couple comments/ suggestions/ questions
>>> 
>>> 1) I think all the actions you suggest for the transport plane failure  
>>> are actually taken in the NRM or Service plane.  I may be wrong, but  
>>> that is what it seems to me.  If so, then I think it would be helpful  
>>> to describe the transport device/plane signalling failure to the NRM  
>>> at different times.  Something like this would have made it easier for  
>>> me to follow.
>> 
>> I agree that transport plane failures actions are either handled by the Service Plane aka NSA (for example "reserve alternative local resources") or in the transport plane (for example switch to backup). I do not understand what you mean by "describe the transport device/plane signalling failure to the NRM" - can you please elaborate?
> 
> I think there should be a statement something like " Transport plane failures are communicated to the NRM.  The NRM deals with these based on the state of the NRM at the time it learns of the failure".  The idea is to make it clear that this is dealing with how to deal with transport failures reported to the service plane.  One might use NSA instead of NRM - I am not sure which would be more appropriate.  I may not have explained this well- please ask questions if it is not clear.

John, Good point. The assumption is that the transport plane failures are somehow communicated up to the resource manager (NRM) and the reservation manager (NRM/NSA). The mechanism on how that happens is out of scope of the architecture document.

>> 
>> The intention of this section was to indicate the error cases which would result in notification to the RA and possible cancelation of a connection. There are cases highlighted where the errors are handled completely by the Service Plane or the Transport plane with no need for notification to the user/RA. 
>> 
> I note that the RA and PA are both in the service plane.  Presumably when an NSA with RA receives a fail message from the PA, the Segment/aggregate section of NSA also has state, and how the NSA deal with the message depends on the state of the NSA.

Agreed. The aggregation of confirms/communication of failures to children RA/PA pairs etc, all happen due to state kept in a particular NSA. What we have to discuss is how much of that state is stored in recoverable, non-volatile storage i.e. if the NSA software/computer/agent crashes and recovers, does it recover all state or not? What if something was "inflight". These are the cases the state machines and the protocol must be resilient to i.e. recover to a stable state. 

> 
>>> 
>>> 2) I don't the understand local and remote distinction in the Service  
>>> Plane failure discussion.  Perhaps local meaning NRM and remote  
>>> meaning reachable through NSI?
>> 
>> Local implies failure of own domain's RA or PA. Remote means failure of the remote RA or PA. The two cases are diagrammatically the same - the difference is in the context.
> This is still confusing to me.   If the session between the RA and PA fails, then isn't everything a local failure - whichever side you are on?  If a PA tries to send a message and it does't make, how does it know whether the message got there or not?  If it is time when it notices the session fails, this also seems both sides are equivalent. 

Well, yes, I sort of agree. The distinction is not that clear. A local failure is that if the local RA/PA crashes and then comes back up - how does it recover state, interact with neighboring RA/PA pairs, deal with missed provision times or state inconsistencies that might occur. The remote case is, if my peer RA/PA crashes or becomes unavailable, what mechanisms/state transitions does it trigger to clean up and arrive at a stable state regardless of the time it takes for the peer to recover from its failure.

>> 
>>> 
>>> 3) I am wondering how service plane failures are discovered?  Is some  
>>> sort of session failure?
>> 
>> There are a couple of assumptions here:
>> 1. There is reliable messaging between RA and PA
>> 2. There is a timeout if responses are not received from the RA/PA (could be after multiple tries). This timeout could be due to a management network failure between RA and PA. 
> 
> These seem like they could have different consequences.

Absolutely - the effects of both are different. But from a peer NSA perspective, it should not care what the cause of the failure is 1) or 2) - it should have the mechanisms to recover to a consistent state.

Hope this helps,
Inder

> 
> I agree that the service plane failures are less well defined so far.  It is good you are thinking about them and starting discussion.  
> The issue of whether NSAs (or only NRMs) keep state after a connection is reserved is another issue that impacts this.
> 
> John
> 
> 
> 
>> Hope this helps - thanks for your feedback.
>> 
>> Inder
>> 
>> 
>>> 
>>> John
>>> 
>>> On Apr 21, 2010, at 1:49 AM, Chin Guok wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> I've attached a draft of the error handling section that Inder and I  
>>>> came up with for the NSI Architecture document.
>>>> 
>>>> This is a rough first draft, and there are some obvious portions  
>>>> missing, but it gives an idea of where we heading.
>>>> 
>>>> Comments are most welcomed.
>>>> 
>>>> Thanks.
>>>> 
>>>> - Chin<NSI Error Handling Chin_Inder  
>>>> v2.docx>_______________________________________________
>>>> nsi-wg mailing list
>>>> nsi-wg at ogf.org
>>>> http://www.ogf.org/mailman/listinfo/nsi-wg
>>> 
>>> _______________________________________________
>>> nsi-wg mailing list
>>> nsi-wg at ogf.org
>>> http://www.ogf.org/mailman/listinfo/nsi-wg
>> 
>> ---
>> Inder Monga				http://100gbs.lbl.gov
>> imonga at es.net			http://www.es.net
>> (510) 499 8065 (c)		
>> (510) 486 6531 (o)		
>> 
> 

---
Inder Monga				http://100gbs.lbl.gov
imonga at es.net			http://www.es.net
(510) 499 8065 (c)		
(510) 486 6531 (o)		

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/nsi-wg/attachments/20100427/bc1c286e/attachment-0001.html