[Nsi-wg] NSI conf call minutes

Thu Nov 11 12:35:12 CST 2010

Peoples,

I had an off-line discussion with Jerry yesterday where we discussed the value of Notifications and the Query operation.  I thought I would summarize some of the key points, and provide a couple of error recovery use cases that will need to be addressed within the protocol solution.  I have not yet read the minutes so some of this might be covered.

Considering yesterday’s call I think we need to distinguish base protocol primitives and operation types supported for each service:

1. There is a base protocol primitive supporting a synchronous request/response message interaction.
2. There is a base protocol primitive supporting an asynchronous message delivery.

Obviously, base protocol primitive #1 could be implemented using #2, but I think distinguishing the two is important.  With these two base primitives we implement the service operations:

1. The request/response synchronous interaction used by operations that can be completed “quickly” use base protocol primitive #1.
2. The request/acknowledgment synchronous interaction uses base protocol primitive #1, but the corresponding response uses base protocol primitive #2.
3. Spontaneously generated service events (whether expected or errors) will use base protocol primitive #2.

We also need to decide if #3 is defined using specific asynchronous error messages defined within the protocol, or as a generic notification mechanism over which a well defined set of error messages can be delivered.

Considering the NSI discussion yesterday, I think we might have agreed that a generic notification mechanism supporting some type of flexible filter based registration would not be specifically required for the connection service specification, as we want to keep it tight and well defined.  Nothing is stopping this type of filtered based notification from being defined in support of other services.

Now onto the topic of error recovery, or more specifically, how would NSA realign after either a peer NSA reboot or a DCN failure between NSA.  This topic came out of the usefulness of Query() and Notify() operations, each of which are used in patterns to achieve the end goal.

Assumptions:
We define a simple NSA deployment where there is one Requester NSA (RA1) and two Provider NSA (PA1 and PA2).

RA1 maintains a list of all service reservations it has made on behalf of the end application.  We will refer to these as reservations owned by RA1.

RA1 will notify applications of service reservation state changes and errors for any services the RA made on behalf of the application.

Notes: From an NSI protocol perspective an NSA restart or a DCN failure to an NSA is functionally equivalent for the purposes of error recovery and realignment.  However, internal to an NSA this is not the case as after a reboot local data recovery an realignment with actual transport network state may need to occur.

Task: Recover current state for existing reservations
RA1 has just established communications with PA1 and PA2 after an outage.  RA1 has a contractual obligation to notify associated applications of state changes on their reservations.  During the outage a change in reservation status may have occurred on dependent reservations within PA1 and PA2 that are part of RA1‘s owned reservations.  RA1 must somehow be able to reconcile the overall status of each owned reservation, and notify the applications of any change from last know state.

Option #1: Do nothing and let future events refresh the state
Not really an option as there can be long durations when events do not occur, but the RA1 service state could be wrong.  In addition, other events such as a local PA reservation deletion (administrative) would go unnoticed.

Option #2: Persistent events
PA1 and PA2 would queue all events to RA1 indefinitely until connectivity has been restored and the events can be successfully delivered.  This would mean PA1 and PA2 would need to persistently store the events until delivered to protect against a local restart.  RA1 would trust the PAs to always deliver events that would allow it to update state.

This model is supported by many Java Messaging System implementations, but is typically used by applications to poll events.  This model does allow for a simplified application since auditing of missed events (distributed state) is not required.  Unfortunately, it puts more resource stress on the server that must maintain events for longer durations and guarantee there is no loss of events under any conditions.

Option #3: Audit component reservations using a query function
RA1 would progress through each of the owned reservations within its database and query each of PA1 and PA2 for the state of component reservations.  If during this audit period new events arrive that update the state of one of there associated schedules, conflict resolution may need to occur to determine if the query result or the event is newer.

I have used this pattern extensively and it works well.  Auditing of reservations can be staggered to avoid flooding neighboring NSA nodes and to control local resource utilization.  Bulk, wildcard, or constrained queries can also help to speed up the alignment process (i.e. query all reservations I own that have had a state change since time x).  In addition, queries are localized to only those PA involved in the RA’s owned services (which would be direct sibling NSA).

Other error recovery tasks we need to address related to failure:
I believe the same alignment task will need to be performed on Provider NSA for component reservations after recovery of a PA-to-PA link.

How does a Provider NSA recover from a failure during path reservation?  For example, RA1 send reservation to PA1 who sends to PA3 who reserves resources and sends back confirmation.  PA1 receives confirmation but fails before committing locally and returning result to RA1.  RA1 times out the operation after getting no reply from PA1 and decides to reserve an alternative route.  PA1 recovers - does it have record of the pending reservation request?  If so does it return the committed resource to RA1 to have RA1 send a cancel?  Does PA1 automatically cancel all reservations down the chain/subtree (i.e. PA3)?

How do we handle a failure during a cancel operation?

We definitely need an error handling section for connection services!

John.

John MacAuley
OpenDRAC Architect
SURFnet bv.

----- Original Message -----
> From: "Guy Roberts" <Guy.Roberts at dante.net>
> To: "NSI WG" <nsi-wg at ogf.org>
> Sent: Thursday, November 11, 2010 7:29:37 AM
> Subject: [Nsi-wg] NSI conf call minutes
> … are available here:
> 
> 
> 
> http://forge.gridforum.org/sf/go/doc16155?nav=1
> 
> 
> 
> Guy
> 
> _____________________________________________________________________
> 
> 
> 
> ** Guy Roberts, PhD Network Engineering & Planning
> 
> * * Tel: +44 (0)1223 371300
> 
> * * City House Direct: +44 (0)1223 371316
> 
> * 126-130 Hills Road Fax: +44 (0)1223 371371
> 
> * Cambridge
> 
> * CB2 1PQ E-mail: guy.roberts at dante.net
> 
> D A N T E United Kingdom WWW: http://www.dante.net
> 
> _____________________________________________________________________
> 
> 
> _______________________________________________
> nsi-wg mailing list
> nsi-wg at ogf.org
> http://www.ogf.org/mailman/listinfo/nsi-wg