[DFDL-WG] Review of draft-gwdi-mil-std-2045-additional-features

Mon Jul 28 13:23:14 EDT 2014

IBM has continued its review of the proposed additions to lengthKind and 
occursCountKind to simplify the modelling of MIL-STD-2045 formats.  The 
email below carries on from an earlier email but has removed everything to 
do with bitOrder etc. New stuff is in blue.

Regards

Steve Hanson
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 28/07/2014 11:31 -----

1) Proposed new dfdl:lengthKind 'fixedLengthOrTerminated'.  

A new enum implies that it can be used in any scenario, so the following 
need to be specified. 
dfdl:terminator must be set and can not be empty string or contain ES on 
its own 
If xs:string or xs:hexBinary, can maxLength facet be used instead of 
dfdl:length? (Suggest no - this is variable length data so min/maxLength 
are for validation only). 
Can dfdl:length be an expression? (Suggest no unless specific use case 
identified)
My use case needs only constants as the maximum, hence enum name contains 
"fixed" prefix, not "explicit". 
Any special rules for emptyValueDelimiterPolicy and 
nilValueDelimiterPolicy ?
Since a terminator must be set, then these cannot be "none" or 
"initiator".  
SMH: Doesn't follow. Today, if I specify a terminator, it must be present, 
modulo EVDP/NVDP. So why is the same not true for the new enum? If we add 
a new enum, it has to work in a way that is consistent with other 
lengthKinds and not just for MIL-STD-2045 use cases.
Use on complex element. Presumably dfdl:length is first used to extract a 
'box' but within that box does parser immediately scan for the 
dfdl:terminator or does it descend into the complex type and parse the 
children, expecting to either consume all the box or to find the 
terminator at the end? (Suggest the latter).
I have no use case that requires this for complex types at all. 
Perhaps we can dodge this by having it be simpleFixedLengthOrTerminated, 
and restricting it to simple types only. ?
SMH: Perhaps, but that makes this lengthKind enum different from all the 
others, and that doesn't seem right. 
Use on complex element. Last child can not be dfdl:lengthKind 
'endOfParent'. 
Scanning rules: Use of this new dfdl:lengthKind switches off any in-scope 
stack of terminating markup in force at that point. Put another way, when 
we are scanning for the dfdl:terminator, we are not looking for any markup 
from an outer scope. 
So there's plenty to think about with this new dfdl:lengthKind. A good 
rule for deciding whether a new dfdl:length or dfdl:occursCountKind should 
be added is whether it bends some other part of the spec out of shape. The 
new dfdl:lengthKind looks ok so far.   

However we *think* we have come up with an alternative model which is 
simpler than you one you state in the document. Example for field 'varstr' 
with max length 100: 

<xs:sequence dfdl:terminator="{if (fn:str-len(varstr) eq 100) then '%ES;' 
else '%DEL'}" ...> 
        <xs:element name="varstr" type="xs:string" 
dfdl:lengthKind="pattern" dfdl:pattern="([^\x7F].\x7F)|(.{100})" ... /> 
</xs:sequence> 

Can't put dfdl:terminator with a self-referencing expression on the 
element. Might need fn:exists in the dfdl:terminator expression to handle 
optionality. Does that work? 

I don't think this will work as %ES isn't allowed in terminators.
There is a proposal to allow it, but only when length kind is such that 
one is not scanning for delimiters (same restriction as for WSP*). Let's 
assume that we allow %ES for now.
SMH: This has been incorporated as an update to erratum 2.148 and is the 
latest spec draft.

One beauty of your idea here is that unparsing will "just work", so that's 
nice.

But I think your pattern has a bug: I think it should be 
dfdl:pattern="[^\x7F]{0,99}(?=\x7F)| .{100}"
This will not capture more than 99 characters prior to the DEL, and will 
not include the DEL as part of the string in the case where a DEL is found 
(uses lookahead in regex). Hence, the DEL will be available to be picked 
off as the terminator. Without this you end up with the DEL in the 
payload. 
With that I think your approach would work. So thanks for that idea. 
SMH: Yes my pattern was wrong, thanks for correcting.

SMH: Also realised that the dfdl:terminator expression is illegal, as it 
looks downwards. The correct DFDL is:

<xs:sequence ...> 
        <xs:element name="varstr" type="xs:string" 
dfdl:lengthKind="pattern" dfdl:pattern="[^\x7F]{0,9}(?=\x7F)|.{10}" ... /> 

        <xs:sequence dfdl:terminator="{if (fn:string-length(./varstr) eq 
10) then '%ES;' else '%DEL'}" .../> 
</xs:sequence> 

I have tested this (using {if (fn:string-length(./varstr) eq 10) then 
'%WSP*;' else '%DEL;'} as %ES; not yet allowed in terminator) and it works 
ok both parse and unparse.

It was noted that if the terminator expression was allowed to refer to the 
value of its own element then this could be simplified to:

        <xs:element name="varstr" type="xs:string" 
dfdl:lengthKind="pattern" dfdl:pattern="[^\x7F]{0,9}(?=\x7F)|.{10}"  
                          dfdl:terminator="{if (fn:string-length(.) eq 10) 
then '%ES;' else '%DEL'}" .../> 

Clearly this relaxation could only occur when lengthKind was not 
delimited. (That is, the same condition that we have proposed allowing 
%ES; for terminator/separator). But I think it also violates the 
known-to-exist rules ? Certainly IBM DFDL says it can't find '.' in the 
infoset when I tried this. So perhaps this is not a good idea.

2) Proposed new dfdl:occursCountKind 'prefixed'.  

The motivation here is to avoid the explosion of global groups needed for 
the hidden presence indicators. It was observed that a single global group 
could be used if the expression used a predicate when referring to the FPI 
element, though obviously that makes the schema very fragile. 

At first glance the new enum would appear to be symmetric with lengthKind 
'prefixed', but on closer examination this is not true:

Presumably the new enum would apply to optional elements and arrays.  It 
would have to fit into the grammar thus:

        Array = [ [PrefixOccursCount Separator] EnclosedElement [ 
Separator EnclosedElement ]*  [ Separator StopValue] ]

        PrefixOccursCount = SimpleNormalRep

It would be wrong to couple the prefix more tightly to the first 
occurrence (by more tightly I mean like prefix length where the length 
occurs after the element's left framing region). When parsing, if the 
value is 0 then nothing else is expected in the data - zero occurrences, 
so no other DFDL properties are even examined. It must therefore occur 
ahead of all occurrences. If it is doing that, then it may as well have 
its own left and right framing, hence use of SimpleNormalRep rather than 
SimpleContent, and work with delimiters. 

However IBM questions the need for the enum as it can also be modelled 
using a choice of two sequences which, if you put the discriminator on the 
hidden FPI element itself, means you can get away with just two global 
groups.  And you don't need outputValueCalc as you can just use defaults.
  ...
  <!-- Element unit_name -->
  <xs:choice>
  <xs:sequence>
    <xs:sequence dfdl:hiddenGroupRef="vmdfdl:gh_mil_std_2045_FPI_true" />
    <xs:element name="unit_name" type="..." ... />
  </xs:sequence>
  <xs:sequence dfdl:hiddenGroupRef="vmdfdl:gh_mil_std_2045_FPI_false" />
</xs:choice> 
   <!-- Element unit_type -->
    <xs:choice>
  <xs:sequence>
    <xs:sequence dfdl:hiddenGroupRef="vmdfdl:gh_mil_std_2045_FPI_true" />
    <xs:element name="unit_type" type="..." ... />
  </xs:sequence>
  <xs:sequence dfdl:hiddenGroupRef="vmdfdl:gh_mil_std_2045_FPI_false" />
</xs:choice> 
  ...

  <xs:group name="vmdfdl:gh_mil_std_2045_FPI_true" >
  <xs:sequence>
    <xs:element name="FPI" type="xs:boolean" default="true" ... >
        <dfdl:discriminator test="{. eq fn:true()}"
    </xs:element>
  </xs:sequence>
</xs:group> 

  <xs:group name="vmdfdl:gh_mil_std_2045_FPI_false" >
  <xs:sequence>
    <xs:element name="FPI" type="xs:boolean" default="false" ... >
    </xs:element>
  </xs:sequence>
</xs:group> 

  3) Proposed new dfdl:occursCountKind 'repeatUntil'.
It seems to IBM that the only practical effect of the new enum 
'repeatUntil' is to simplify the discriminator. It doesn't remove it nor 
does it remove the need for the hidden FRI element. IBM does not see the 
benefit of the new enum in its proposed form. Further...

If the above proposal is used for the FPI, the dfdl:occursIndex() branch 
of the discriminator simplifies to fn:true().
The FRI is local to the array element so, when parsing at least, there is 
no need for a globally unique group for each array. 
That simplifies the discriminator to the following and means you only need 
one global group for FRI.
<dfdl:discriminator>
        if (dfdl:occursIndex() eq 1 then fn:true() else 
../<array>[dfdl:occursIndex()-1]/vmfdfdl:gh_mil_std_2045_FRI
<dfdl:discriminator>

For that to work on unparsing there needs to be a generic way to set the 
(Boolean) FRI from within the hidden group.  Something like 

        dfdl:outputValueCalc="{dfdl:occursIndex() eq fn:count(..)}" 

There is a problem with this though. The property is on the FRI element so 
what does dfdl:occursIndex() return? The spec says it returns "the 
position of the current item within an array" but also says "this function 
may be used on non-array elements". I'm not clear what it would return for 
the latter case - does it return 1 or does it look back to its parent or 
... ? Here we want the index of the parent. Perhaps this function needs to 
take an argument to be unambiguous, eg, . or .. or ../.., ie, it can only 
refer back up to the root.  (In fact this problem applies whether or not 
there is a single FRI or one per array).

A counter proposal...

One way to really simplify this type of occurrence indicator is to 
consider it as part of the element, in the same way as a length prefix. 
This tight binding makes sense here, because there is an indicator per 
occurrence. 

        dfdl:occursCountKind="stopIndicator' 
dfdl:occursStopIndicatorType="<type>"

The stop indicator type must be derived from xs:boolean. True means the 
occurrence is the last. False means it is not. Or we can do it the other 
way round) The DFDL Boolean properties of the type can always be used to 
compensate. The parser would work a bit like it does for 'stopValue' - it 
keeps parsing speculatively until it finds an occurrence which indicates 
the end of the array - the difference being that in this case it is added 
to the infoset.  The oddity about this is that it applies to arrays only 
and does not work with optional elements, so it can not be used with 
minOccurs = '0'. 

Grammar becomes:

SimpleNormalRep = LeftFraming StopIndicator PrefixLength SimpleContent 
RightFraming
ComplexNormalRep = LeftFraming StopIndicator PrefixLength ComplexContent 
ElementUnused RightFraming

StopIndicator = SimpleContent

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20140728/1f0aabd1/attachment-0001.html>