[DFDL-WG] Fw: Action 145: 'dispatch' way of discriminating a choice for better performance

Mon Apr 16 14:33:51 EDT 2012

Hi Tim,

I think all our expressions today return a schema logical type, so you are
correct in that the expression should not return DFDL string literal.   But
I don't see the harm in letting dfdl:elementID be a DFDL string literal. We
can disallow some of the DFDL entities, like we do for padding character,
etc, if they don't make sense.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848

From:	Tim Kimber/UK/IBM
To:	Steve Hanson/UK/IBM at IBMGB
Cc:	dfdl-wg at ogf.org
Date:	11/04/2012 10:15
Subject:	Re: [DFDL-WG] Fw: Action 145: 'dispatch' way of discriminating
            a choice for better performance

I think the type of the property should be 'xs:string', not 'DFDL String
Literal'. A DFDL String Literal allows raw bytes, DFDL character entities
and 'generic mnemonics' such as '%WSP*'. I don't see any value in allowing
those in this property - the data in the data stream can easily be
converted to a string as part of the DFDL expression. If we did allow DFDL
String Literal then we would have to specify exactly how to match one DFDL
String Literal against another, but I don't think we should go there.

Proposed new wording:

A new dfdl:element property is added called dfdl:elementID of type 'DFDL
String Literal' 'xs:string'. This provides a single alternative name for
the element. Only allowed on local element and global element, not on
element reference or simple type.

A new dfdl:choice property is added called dfdl:choiceBranchRef of type
'DFDL Expression'. The expression must evaluate to a DFDL String Literal
xs:string. The resultant string must match (case insensitive) the
dfdl:elementID property values of one of the element branches of the
choice, and if so discriminates in favour of that branch. The parser then
goes straight to that branch, ignoring schema order. Because the branch is
'known to exist' no backtracking takes place if a processing error
subsequently occurs.

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  kimbert at uk.ibm.com
Tel. 01962-816742
Internal tel. 246742

From:	Steve Hanson/UK/IBM at IBMGB
To:	dfdl-wg at ogf.org
Date:	11/04/2012 03:57
Subject:	[DFDL-WG] Fw: Action 145: 'dispatch' way of discriminating a
            choice for better performance
Sent by:	dfdl-wg-bounces at ogf.org

Updated to reflect the outcome of DFDL WG call on 5th April.

Revised details:

A new dfdl:element property is added called dfdl:elementID of type 'DFDL
String Literal'. This provides a single alternative name for the element.
Only allowed on local element and global element, not on element reference
or simple type.

A new dfdl:choice property is added called dfdl:choiceBranchRef of type
'DFDL Expression'. The expression must evaluate to a DFDL String Literal.
The resultant string must match (case insensitive) the dfdl:elementID
property values of one of the element branches of the choice, and if so
discriminates in favour of that branch. The parser then goes straight to
that branch, ignoring schema order. Because the branch is 'known to exist'
no backtracking takes place if a processing error subsequently occurs.

Rules:

- Both properties behave like dfdl:ref and dfdl:hiddenGroupRef in that it
is not possible to set a value in scope by a dfdl:format annotation, and is
only set at its point of use. This is because there is nothing sensible
that could be set in scope **. Empty string is not an allowed value.

- Both properties are only used when parsing.

- When dfdl:choiceBranchRef is present, all choice branches must be local
elements or element references. It is a schema definition error otherwise.

- It is a processing error if the resolved value of dfdl:choiceBranchRef
does not match one of the branches.

- It is a schema definition error if individual dfdl:elementID values are
not unique across all elements that are branches of an xs:choice that
carries dfdl:choiceBranchRef.

- It is not a schema definition error if both dfdl:initiatedContent and
dfdl:choiceBranchRef, are provided on the same choice, nor if a
discriminator exists on that choice branch. ++

** Consequently adding support for the property to existing DFDL
implementations will not suddenly cause errors to appear in existing DFDL
schemas.

++ Reasoning. If a discriminator is evaluated and there is no point of
uncertainty, it simply behaves like an assert, specifically if it fails it
causes a processing error without backtracking. Use of dfdl:choiceBranchRef
is just another way to get into this situation. (The property
dfdl:initiatedContent may be rewritten as a series of discriminators on the
child elements which test that the (initiated) element exists, hence same
applies).

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 28/03/2012 16:32 -----

From:        Steve Hanson/UK/IBM
To:        dfdl-wg at ogf.org
Date:        27/03/2012 14:09
Subject:        Fw: [DFDL-WG] Action 145: 'dispatch' way of discriminating
a choice for better performance

Hi Mike

I agree. When I wrote the SWIFT example expression that I realised this was
becoming a bit too complicated.  IBM MRM has a quick way of dispatching a
choice that uses a similar mechanism to the one you propose.
It has an optional property called 'Message Alias'. The parser first tries
to match the tag against the element name, and if that fails it tries the
alias.  I've modified my proposal accordingly.

Regarding wildcards, the approach you show was what we had agreed upon a
couple of years ago when we dropped wildcards. I would prefer not to add
them back in for 1.0, but I have had some feedback from IBMers modeling
envelope/payload formats that they want a loose-coupling approach as that
fits better with other tools and is scalable. Let's treat this as a
separate issue. I'll go back to the IBMers so that I fully understand their
concerns, and if necessary I will bring back to the WG as a new agenda
item.  So all we need to ensure for now is that whatever mechanism we come
up with is extensible to wildcards, if needed.

Revised details:

A new dfdl:element property is added called dfdl:elementAlias of type 'List
of String'. This provides one or more alternative names for the element.
Only allowed on local element and global element, not on element reference
or simple type.

A new dfdl:choice property is added called dfdl:choiceBranchRef of type
DFDL Expression. The expression must evaluate to String. The string must
match (case insensitive) either the element name or one of the
dfdl:elementAlias property values of one of the element branches of the
choice, and if so discriminates in favour of that branch. The parser then
goes straight to that branch, ignoring schema order. Because the branch is
'known to exist' no backtracking takes place if a processing error occurs.

Rules:

- Both properties behave like dfdl:ref and dfdl:hiddenGroupRef in that it
is not possible to set a value in scope by a dfdl:format annotation, and is
only set at its point of use. This is because there is nothing sensible
that could be set in scope. But it has the benefit that adding support for
the property to existing DFDL implementations will not suddenly cause
errors to appear in existing DFDL schemas.  Empty string is not an allowed
value.

- Both properties are only used when parsing.

- When dfdl:choiceBranchRef is preent, all choice branches must be local
elements or element references. It is a schema definition error otherwise.

- It is a processing error if the resolved xs:string value of
dfdl:choiceBranchRef does not match one of the branches.

- It is a schema definition error if a choice has both dfdl:choiceBranchRef
set and dfdl:initiatedContent="yes".

- It is a schema definition error if individual dfdl:elementAlias values
are not unique a) across all elements that are branches of a given
xs:choice and b) across all global elements in the schema.

So we now have the ability to apply a simple lookup before we start to
process a choice. That makes the processing time for each branch of the
choice independent of its order in the schema.

Questions:

- What happens if we encounter a discriminator once we are processing the
branch and its point of uncertainty is the choice ?

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 27/03/2012 13:41 -----

From:        Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:        Steve Hanson/UK/IBM at IBMGB
Cc:        dfdl-wg at ogf.org
Date:        21/03/2012 15:57
Subject:        Re: [DFDL-WG] Action 145: 'dispatch' way of discriminating
a choice for better performance (updated)
Sent by:        dfdl-wg-bounces at ogf.org

The whole point of this thing is to be faster, not more general, so my
reaction is too much XPath expression complexity here.

Consider this. If the tag is some 2-character code that does NOT want
to be the same as the element names (for example because they're
digits, so they can't be the element names exactly since element names
have to begin with an alpha char. Digits also aren't useful from
readability perspective as names), then you'll need a big lookup table
in the choiceBranchRef expression that translates from the codes to
the QNames. We don't have a case statement in the expression language,
so you've just moved the big linear evaluation chain out of evaluating
choice discriminators one after another and into a big if-then-else
nest in the choiceBranchRef expression. I don't see a performance gain
here.

I suggest dropping the QName stuff, and requiring a dfdl:choiceID
property on the elements that is an NCName.  (Well, we might want
those QName functions anyway in the expression language. But I
wouldn't use them for this rapid choice dispatch feature. You could
certainly use them in discriminators. )

The expression would then have to evaluate to a value that is matched
against this choiceID. I suggest exact match, not respecting
ignoreCase for example.

That eliminates all the QName complexity and is amenable to high-speed
compact lookup table implementation.

I tend to think the element names want to be a little bit more
descriptive than these tag values would want to be so using the
element names as the tags feels undesirable to me.
Particularly because we want the tags to be conveniently computed, for
example by just grabbing a fixed-length string out of a data field.

You end up with something like this:

   <element name="tag" type="string" dfdl:length="{ 2 }" ..../>
   <choice dfdl:choiceBranchRef="{ ../tag }">
       <element name="someName"     dfdl:choiceID="02" .../>
       <element name="anotherName" dfdl:choiceID="73" .../>
       ....
  </choice>

As for the wild-card issue. I think we can finesse this. Consider this
model:

<element name="tag" type="string" dfdl:length="{ 2 }" ..../>
<choice>
   <!-- fast dispatch for known record types -->
   <choice dfdl:choiceBranchRef="{ ../tag }">
       <element name="someName"     dfdl:choiceID="02" .../>
       <element name="anotherName" dfdl:choiceID="73" .../>
       ....
  </choice>

  <!-- wildcard -->
  <element name="extensionRecord">
     <complexType>
       <sequence>
          <!-- keep tag copy in the extension -->
          <element name="extensionType" type="string"
dfdl:inputValueCalc="../../tag" .../>
          ....
       </sequence>
    </complexType>
 </element>
</choice>

The inner choice uses the fast dispatch. The outer choice lets me also
have an alternative that absorbs a more general syntax to provide some
way for a user to model their extensions to the choice set. The
extensionType field captures the tag and stores it inside the
extension record where it won't get disassociated.

The user's "extensionRecord" would not be a special DFDL wildcard
construct, just an element format they create which is general enough
to parse their extensions. Or this extension could be predefined as
part of a standard, to accept any standard-defined acceptable
extension record a user might need so long as there is some set of
rules all extension records must obey.

Given this, do we really need special wildcard constructs?

...mikeb

 From:   Steve Hanson <smh at uk.ibm.com>                                     

 To:     dfdl-wg at ogf.org,                                                  

 Date:   24.03.2012 02:27                                                  

 Subject [DFDL-WG] Action 145: 'dispatch' way of discriminating a choice   
 :       for better performance                                            

The enveope/payload style of data format is quite common, where the
envelope provides control information and the payload contains the business
data. Examples are SWIFT and SAP IDocs. Typically the envelope contains a
tag that identifies the payload, which can be one of many types. For SWIFT
there are 300 possible types. To model this today in DFDL requires an
xs:choice with each type modeled as an xs:element branch of the choice. A
discriminator on each xs:element refers back to the envelope tag element
thus enabling the choice to be resolved.

There are two issues with this approach.

1) Performance. Even if the elements in the branches are ordered for
expected frequency, there will still be cases when tens or hundreds of
discriminators need to be evaluated before the choice is resolved.

2) Tight coupling. When a new type is added, a new element branch needs to
be added to the choice.

Action 145 proposes a mechanism to solve issue #1 and which opens the door
to a possible extension to DFDL to solve issue #2 - namely a faster way to
resolve a choice.

Details:

A new dfdl:choice property is added called dfdl:choiceBranchRef of type
DFDL Expression. The expression must evaluate to a QName which corresponds
to one of the element branches of the choice, and asserts 'known to exist'
for that branch.  Rules:

- The property behaves like dfdl:ref and dfdl:hiddenGroupRef in that it is
not possible to set a value in scope by a dfdl:format annotation, and is
only set at its point of use. This is because there is nothing sensible
that could be set in scope. But it has the benefit that adding support for
the property to existing DFDL implementations will not suddenly cause
errors to appear in existing DFDL schemas.

- Empty string is not an allowed value.

- The property is only used when parsing.

- All branches must be local elements or element references. It is a schema
definition error if any branch is a sequence, a choice or a group
reference.

- It is a processing error if the QName does not resolve to one of the
branches when parsing..

- It is a schema definition error if a choice has the property set and also
has dfdl:initiatedContent="yes" set locally.

- Because the expression must return a QName, the expression language must
provide a constructor for creating a QName from a namespace string and a
name string. If you take SWIFT MT103 payload as an example, the tag in the
envelope says "103" but a DFDL schema would actually model the global MT103
element with name "Document" and namespace ="urn:swift:xsd:fin.103.2011".
So the dfdl:choiceBranchRef expression would have to look like:
{fn:QName(fn:concat(fn:concat('urn:swift:xsd:fin.',
FinMessage/Block2/MessageType), ".2011"), 'Document')}

So we now have the ability to derive a QName and apply it before we start
to process a choice. That makes the processing time for each branch of the
choice independent of its order in the schema.

We still have issue #2 so when a new payload is added, a new branch must be
added to the choice. A solution to this is to allows xs:any wildcard
elements back into DFDL, then provide a property dfdl:wildcardRef which
works in the same way as dfdl:choiceRef. So at the point of encountering
the wildcard we know its resolution in the schema.  This obviously will
require some further discussion, but you can see how this ability to
evaluate an expression and return a QName can be used in multiple ways.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

--
 dfdl-wg mailing list
 dfdl-wg at ogf.org
 https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg