[Nsi-wg] Comments on NSI architecture document.
John MacAuley
john.macauley at surfnet.nl
Wed Jun 16 17:07:26 CDT 2010
Peoples,
I spent some time going through the specification in detail last week,
but unfortunately, I used an older version of the document so last night
I consolidated my comments against Guy’s version I pulled down on
Monday. I colour coded for your reading pleasure. Red implies original
text from the document. I have attached a word version of the comments
if the formatting gets lost in transit.
John.
_Section 3.4 NSI Service Definitions_
“A service request is fully specified when all parameters associated
with that service have been determined either by explicit user
specification or by implicit default values found in the Service
Definition.”
Are service definition defaults a common global configuration or are
these defaults a localized decision? If they are a localized decision
then the requestor NSA should “fill in the blanks” so that all
subsequent provider NSA contacted have the assumed default values filled
in the service request.
And similarly,
_Section 5.1.2 Service Definitions for Connection Services_
“If a service parameter is not present in the service request, then the
provider NSA should “fill in the blanks” from default values in the
Service Definition. As the request is processed down the NSA service
tree, default values adopted in one transit network may implicitly
constrain the request in downstream networks. Therefore, in general,
each NSA should use default values that provide the greatest leeway to
the pathfinder in satisfying the request both within the local network
and in external downstream networks.”
This mechanism is rather complex as described. If service parameters are
left open ended by some NSA, then an additional visit to that NSA must
be performed to finalize the actual negotiated parameters. In the tree
model this would require a second pass to commit the final service
definition negotiated across the network. In the chain model it would
require the end terminating NSA in the chain to finalize the service
definition and then every node returning up the chain would finalize
their definition.
_Section 3.6 Trust and authentication in NSI_
The term “service handler” is used for the first time. Should this read,
“message handler” as defined in figure 5?
“The second mode is to employ a more message based trust framework such
as Web Services. This message based form is more appropriate for
occasional messaging as might occur between an application agent and
various provider NSAs.”
I believe this last statement is subjective and should be removed from
the document. I am currently working on a production product that is
processing over 400 SOAP messages a second with per message
authentication while performing other more computationally heavy tasks.
I think this exceeds the expectation of “occasional” :-)
_Section 3.7 Error handling NSI_
The term “Network errors” is ambiguous based on the topic at hand. Could
we better qualify this to “Service plane (NSI protocol infrastructure)
errors” to distinguish this from transport plane network errors? I was
going to complain about our confusing use of the term “service” but have
no valid alternative :-)
“A failure in the Service Plane should not result in an incomplete
service.” should be restated as a goal objective. If we have started to
provision an end-to-end transport connection and part way through we
have a provider NSA or DCN failure we have no way of knowing if the
sub-network connection was established or not, and therefore, we are in
an incomplete state that we cannot recover from.
“For example, a user may request that if any NSA fails, all the NSAs
handling the same service instance should tear down the Connection
Service in the Transport Plane.” This is a very interesting error
handling scenario. I had assumed that only the requesting NSA would need
to listen for failure notifications from individual provider NSA against
the services it instantiated, but with this example we imply that each
NSA will listen for events from other NSA on services it may only tandem
so that it might tear down the tandem during failures (a chain would
only require adjacent NSA to be monitored as failures would be
cascaded). Do we really want this additional complexity to protect
against a double failure? I would recommend we keep it simple and have
the requesting NSA decide when to tear down the transport resources
through a cancel request, otherwise, all connection resources get
cleaned up at end-time.
“Failures in the Service Plane during Reservation, Provisioning,
Teardown, and Release phases can cause problems for the operation of the
NSI.” Do we want to normalize these phases against the states described
in Figure 15? Specifically, the phase “teardown” is not stated in Figure
15. In fact, is “teardown” not redundant with “releasing”?
“Figure 11: Local/Remote Failures” was a bit confusing for me. Does the
rounded square represent a local NSA?
Should we expand this section, or add an appendix covering error use cases?
_Section 3.8 Transport failure awareness_
Detection of transport errors should be a local issue but the NSI
protocol needs to specify a mechanism to notify other NSA of a local
transport failure against a connection. The correlation of local
transport error to impacted connections is a local matter.
Once again, should we expand this section, or add an appendix covering
error use cases?
_Section 5.1.3 The Connection Service States_
“In the NSI, a connection goes through five phases: Reserving,
Scheduled, Provisioning, In-Service, Releasing.” I think we could
benefit from having a high level state machine in the document to
capture additional information implied in the text. As I was trying to
correlate the phases to the operations as defined in Figures 15 and 16,
as well as include error handling, I formed the opinion that we need
some additional state information beyond what we in Figure 15 to show
the life cycle of a connection.
“When the Release has completed, the connection object is deleted from
the Service Plane.” Given my previous statement I believe we do not want
to delete the connection object after resources have been freed, but
introduce additional end states that allow the object to exist after the
scheduled end time. At the moment, if a cancelRequest is processed to
completion the connection object would end up being deleted as soon as
the transport resources have been released. Now I can no longer see the
state of this connection object, and therefore, cannot determine any
state information about the connection after the fact. If the
originating user did not issue the cancel request, they would have no
way to query their connection to see what happened.
I think an easy solution to this problem is to have a set of end states
for a connection object and place a hold over timer on the connection
object that would eventually remove it from the NSA, but only after a
period of time (say 24 hours). We also need to clearly indicate if the
connection was terminated due to error, a cancel request, or if the
scheduled end-time occurred.
_Section 5.1.4 Connection reservation messages_
“If the connection request includes a valid start-time and an end-time
then the request is considered to be an advance reservation request.”
Does this preclude me from specifying a ”duration” instead of an
“end-time” for an advance reservation?
“If the connection request has the start-time set to ‘asap’ and has a
duration field rather than an end time field, the request is considered
to be an immediate reservation request.” Why preclude an end-time as a
possible field for the immediate reservation? We only need trigger off
the “asap” to determine it is an immediate reservation.
Can we normalize the “de-provisioning” terminology to “releasing” as
stated in previous sections?
“When operating in explicit mode, it is the responsibility of the
requestor NSA to signal the reservation to begin provisioning and to
begin de-provisioning of the connection. These signals are known as the
ProvisionRequest and CancelRequest.” Based on the statement in section
5.1.5, paragraph 2, “The reservation end-time refers to the time at
which the reservation is removed. (If the user has not yet sent a
CancelRequest signal the connection is de-provisioned first)” can I
assume that “signaling of de-provisioning” is optional and both
automatic and explicit mode connections will be automatically torn down
when end-time occurs?
_Section 5.1.5 Connection reservation and timing parameters_
I think we need to make the following two definitions consistent as they
introduce conflicting code logic that really doesn’t need to be.
1. “For advance reservation with /automatic/ provisioning, the
start-time refers to the time at which the connection moves from
provisioning state to in-service state.”
2. “For advance reservation with /explicit/ provisioning, the start-time
refers to the time at which the provider is able to accept a provision
signal.”
In #1 the NSA must start provisioning the local connection segment at
“start-time” – “guard-time” (this is my definition of guard-time and not
the one in the document) so that the connection can be “in-service” by
“start-time.” However, for #2 the “start-time” parameter represents the
point at which the requesting NSA can request provisioning of the
connection to start. In the case of #2, the actual “in-service” state is
achieved at “start-time” + “guard-time” and not “start-time” as in #1. I
think we should try to avoid this type of confusion in the document, as
it will also imply two separate definitions in an NSA implementation.
May I suggest that the behavior of the “ProvisionRequest” operation
changes the state of an “advance reservation with explicit provisioning”
to an “automatic provisioning” state. This would be beneficial for two
reasons:
1. A Requestor NSA can send down a “ProvisioningRequest” operation
before “start-time” without receiving an error for issuing the request
too early. The Provider NSA would then transition the explicit
reservation to an automatic state and start provisioning the connection
at “start-time” – “guard-time”.
2. If the Requestor NSA is made aware of the connection provisioning
“guard-time” it can issue the “ProvisioningRequest” operation at
“start-time” – “guard-time” to get the same behavior as the automatic
provisioning case.
In both cases the “ProvisionConfirmation” notification, or perhaps a new
ActivationConfirmation notification (if we want to keep the other one as
ack to the operation itself), would be sent back when the connection is
provisioned in the transport plane.
“The reservation end-time refers to the time at which the reservation is
removed. (If the user has not yet sent a CancelRequest signal the
connection is de-provisioned first).” Could we not also use “start-time
+ duration” to imply “end-time”?
““Infinite” can be used as an end time. In this case, resources are
reserved forever (i.e. until a release request is received or may be
overwritten by policy limits). ” If we get an infinite duration request
and there is an NSA policy specifying a maximum connection duration, why
would we not reject the connection request with an appropriate service
definition policy error providing the policy maximum? This would allow
the requesting NSA to find an alternative route that could support an
infinite duration, or at least adjust expectations in a subsequent request.
“It takes some time to process a request. Possible maximum time required
to process a request and make resources ready for provisioning is called
“guard time”.” I thought that “guard-time” also included the transport
provisioning/activation overhead as well? This definition seems to only
cover the reservation overhead, so a requesting NSA must utilize a
different value for the time to provision.
“This system is designed to be compatible with systems based on 2PC.” I
do not think this statement is 100% true given the example provided.
When we modified DRAC to support the Phosphorus/Harmony interface model
they also had a two phase commit (reserve/hold resources and commit
resources) which also included an explicit activation operation as well.
I always questioned the value of commit operation, as it didn’t save us
anything in DRAC as the reserve costs the most. I can now see the value
if we are trying to do a “start-now” operation but want to make sure an
end-to-end path is available before provisioning connections.
_Section 5.1.7 Tree and Chain Connection modes for inter-domain pathfinding_
I found this section interesting in that even though I understand
general routing, tree, and chain path finding I had to read the section
three times and made a ton of notes before concluding that I did
understand what was written. I think restructuring this section a bit
would solve my problem.
This section would benefit from a reference network topology diagram and
an example course grained inter-domain path computation somewhere before
tree and chain path finding are introduced. Figures 17 and 18 can then
reference NSA and nodes from the example diagram to show how a path
through the network could be reserved.
The general description of chaining needs to be expanded to provide some
additional details. I would have expected head-end path computation as
the first step to determine a rough path through the network, and to
guide the “next hop” to receive the request in the chain. Although a
similar statement was made for tree based path finding it was not stated
for chain based, although, it may have been implied in the statements
around reservation.
I am also concerned with this statement: “Alternatively, if the local
NSA does not have sufficient topology information or authorization
credentials to identify and interact directly with all the downstream
networks, the local NSA can simply choose a neighbor network as the next
hop, and using the interconnect STP as the ingress point, forward a
request to that next hop NSA for handling.” This statement implies to me
that I do not need to do head-end path computation and I can just throw
the request to any adjacent node and it would magically reach its
destination. In a highly connected network this might be a feasible
plan, but in other cases there could be a lot of dead end computations
before a viable path it found.
Lastly, the statement “It is highly distributed, scales well and is
robust” could be brought into question given the description of chaining
in this section ;-)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/nsi-wg/attachments/20100616/4f0a1d26/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: NSI architecture - JHM comments June 16 2010.docx
Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Size: 212751 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/nsi-wg/attachments/20100616/4f0a1d26/attachment-0001.bin
More information about the nsi-wg
mailing list