[Nsi-wg] NSI error handling draft - next version

Wed Apr 28 13:38:43 CDT 2010

John,

Great points about administrative and maintenance procedures. 

We would have to make an assumption that the NSA/NRM gets an event with the right "notification" of the reason for topology change - through the OSS/network management platform. Otherwise, we will not be able to differentiate between the cause of the topology change and will not be able to estimate the duration of that change like in case of maintenance. We can assume the default case to be #1 if the not notified of the exact cause.

Thanks,
inder

On Apr 28, 2010, at 8:50 AM, John MacAuley wrote:

> Peoples,
> 
> Had someone show up in my office so I missed the conversation over "Resource change from available to not available."  I thought I would provide some input on the topic based on my DRAC experiences.
> 
> I think there are three types of events that can initiate a topology change that should be understood when defining the error handling.  Two of these are actually not errors but normal operating procedures within a network:
> 
> 1. Physical network failure resulting in a topology change - typically the temporary removal of a link from topology with no knowledge of when it will be restored.
> 
> 2. The permanent removal of a link from the topology by a network administrator.  Actually, this one should include the reconfiguration of the network where an entire node could be removed.
> 
> 3. The temporary removal of a link by a network administrator for maintainence purposes.  This will typically have a defined start and end time based on the maintenance window.
> 
> #1 is interesting in that it impacts existing schedules in an in-service state, reserved schedules not yet in service, and any new reservation requests.
> 
> a) Those schedules in-service using the links impacted by the topology change may undergo some type of restoration.  If this was a protected circuit then underlying transport will restore the service and we may not want to do anything about it.  If this was an unprotected service then perhaps re-dial could be initiated by the NRM in an attempt to achieve a lazy restore.
> 
> b) Depending on the estimated length of the temporary topology change we may need to recompute the paths of those schedules reserved but not yet provisioned.  We should not recompute the paths from the point of failure to the end of time but for some predefined floating window optimistic enough to give the failure time to recover, and reduce the amount schedules that would be recomputed.  For example, a floating one hour window would mean all reservations up to an hour in the future that could be impacted by the failure can be recomputed.  If the failure is cleared and the topology is restored then there is a one hour window that should have been cleared.  The interesting side-effect is we now have a window of time to make sure the link remains trouble free.  The question is have we blocked that link from use or can a new schedule use the remaining hour if it comes in after the trouble has cleared.
> 
> c) If a new reservation request for a future point in time arrives while a failure has taken the link out of topology do we remove the link from computation, or do we add an optimistic guard time after which we can assume the link will be restored?
> 
> #2 is different from a fault condition in that an administrator has removed the link from topology.  We can model this gracefully if we can have a high priority (preemptive) administration reservation that can block the bandwidth on a link from the point in time the link will be removed through until infinity.  Any schedules this preemptive schedule impacts will need to be recomputed as discussed in the previous example, or if provisioned switched to protection/re-dialed to restore.  At some point on or after the start of the preemptive schedule the link can be permanently removed from topology and the reservation blocking that link cleared.
> 
> #3 is similar to #2 except there is a defined end time for the preemptive schedule blocking the link.  Only reservations overlapping with the maintenance window would need to be recomputed.  Obviously, any provisioned schedules would need to be switched to protection or re-dialed to restore.
> 
> John.
> 
> On 10-04-28 2:14 AM, Inder Monga wrote:
>> 
>> Hi All, 
>> 
>> An updated draft based on comments. We attached a table in the front to summarize and use it for discussions. Look forward to discuss this tomorrow. 
>> 
>> Thanks, 
>> Inder 
>> 
>> 
>> 
>> On Apr 20, 2010, at 10:49 PM, Chin Guok wrote: 
>> 
>>> Hi all, 
>>> 
>>> I've attached a draft of the error handling section that Inder and I came up with for the NSI Architecture document. 
>>> 
>>> This is a rough first draft, and there are some obvious portions missing, but it gives an idea of where we heading. 
>>> 
>>> Comments are most welcomed. 
>>> 
>>> Thanks. 
>>> 
>>> - Chin<NSI Error Handling Chin_Inder v2.docx>_______________________________________________ 
>>> nsi-wg mailing list 
>>> nsi-wg at ogf.org 
>>> http://www.ogf.org/mailman/listinfo/nsi-wg 
>> 
>> 
>> _______________________________________________
>> nsi-wg mailing list
>> nsi-wg at ogf.org
>> http://www.ogf.org/mailman/listinfo/nsi-wg
>>   
> 
> _______________________________________________
> nsi-wg mailing list
> nsi-wg at ogf.org
> http://www.ogf.org/mailman/listinfo/nsi-wg

---
Inder Monga				http://100gbs.lbl.gov
imonga at es.net			http://www.es.net
(510) 499 8065 (c)		
(510) 486 6531 (o)		

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/nsi-wg/attachments/20100428/2746f622/attachment.html