[DFDL-WG] Fw: Action 233 (deferred) - "byte order not sufficient..." - draft document on experience with binary format MIL-STD-2045

Mon Jul 14 07:07:32 EDT 2014

Mike, some further responses in-line.

Regards

Steve Hanson
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848

From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     Steve Hanson/UK/IBM at IBMGB, 
Cc:     "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>
Date:   11/07/2014 18:24
Subject:        Re: [DFDL-WG] Fw: Action 233 (deferred) - "byte order not 
sufficient..." - draft document on experience with binary format 
MIL-STD-2045

Thanks for this additional input. 

Some further thoughts from IBM on your recommendations, after more 
internal discussion here. 

Preferable to have dfdl:bitOrder as a separate property rather to handle 
it via new dfdl:byteOrder enums. Although new properties pose validation 
issues for existing schemas, this should not compromise the language 
design. DFDL can choose what bitOrder/byteOrder combinations are 
supported. 

OK with with new dfdl:byteOrder enum for littleEndianAtomic16Bit though 
can we improve the name?
I am absolutely open to suggestions on the name. I adapted this name from 
the wikipedia article terminology. 
SMH: I would just drop the atomic so littleEndian16Bit

dfdl:encoding has an architected system for extra encodings so 
US-ASCII-7-Bit-Packed should be x-US-ASCII-7-Bit-Packed, and the spec 
updated to remove specific mention of US-ASCII-7-Bit-Packed.
Thoughts: if there is no support for this 7-bit packed ascii flavor, then 
there is no point in having dfdl:bitOrder support. The two go together.
SMH: bitOrder has nothing to do with encoding. I could create a format 
with no strings in it and my integers etc could have LSBF bitOrder.  So 
while in  MIL-STD-2045 they might always appear together, that is not 
generally true.

So in the section on optional DFDL features would we say this is the 
optional feature:
dfdl:bitOrder="leastSignificantBitFirst" and 
dfdl:encoding="x-dfdl-us-ascii-7-bit-packed"
Or is there no mention of the encoding?
SMH: They are separate things so there should be no mention of the 
encoding.

I raise this because the two really go together. There is no point in 
having one without the other, and there needs to be an agreed-upon 
standard meaning for x-dfdl-us-ascii-7-bit-packed encoding.  So this 
x-dfdl-us-ascii-7-bit-packed is a DFDL standard, not an 
implementation-defined standard.  
SMH: I agree that there needs to be a standard definition for 
x-dfdl-us-ascii-7-bit-packed. Its definition is certainly not 
implementation-defined, though whether it is supported is. The question is 
whether it is defined as part of DFDL 1.0 spec, or whether it is defined 
externally. Given that we devolve encoding definitions externally to IANA 
and CCSID, it would be more consistent to point at an external definition.

We discussed proposed new dfdl:lengthKind 'fixedLengthOrTerminated'.  A 
new enum implies that it can be used in any scenario, so the following 
need to be specified. 
dfdl:terminator must be set and can not be empty string or contain ES on 
its own 
If xs:string or xs:hexBinary, can maxLength facet be used instead of 
dfdl:length? (Suggest no - this is variable length data so min/maxLength 
are for validation only). 
Can dfdl:length be an expression? (Suggest no unless specific use case 
identified)
My use case needs only constants as the maximum, hence enum name contains 
"fixed" prefix, not "explicit". 
Any special rules for emptyValueDelimiterPolicy and 
nilValueDelimiterPolicy ?
Since a terminator must be set, then these cannot be "none" or 
"initiator".  
SMH: Doesn't follow. Today, if I specify a terminator, it must be present, 
modulo EVDP/NVDP. So why is the same not true for the new enum? If we add 
a new enum, it has to work in a way that is consistent with other 
lengthKinds and not just for MIL-STD-2045 use cases.
Use on complex element. Presumably dfdl:length is first used to extract a 
'box' but within that box does parser immediately scan for the 
dfdl:terminator or does it descend into the complex type and parse the 
children, expecting to either consume all the box or to find the 
terminator at the end? (Suggest the latter).
I have no use case that requires this for complex types at all. 
Perhaps we can dodge this by having it be simpleFixedLengthOrTerminated, 
and restricting it to simple types only. ?
SMH: Perhaps, but that makes this lengthKind enum different from all the 
others, and that doesn't seem right. 
Use on complex element. Last child can not be dfdl:lengthKind 
'endOfParent'. 
Scanning rules: Use of this new dfdl:lengthKind switches off any in-scope 
stack of terminating markup in force at that point. Put another way, when 
we are scanning for the dfdl:terminator, we are not looking for any markup 
from an outer scope. 
So there's plenty to think about with this new dfdl:lengthKind. A good 
rule for deciding whether a new dfdl:length or dfdl:occursCountKind should 
be added is whether it bends some other part of the spec out of shape. The 
new dfdl:lengthKind looks ok so far.   

However we *think* we have come up with an alternative model which is 
simpler than you one you state in the document. Example for field 'varstr' 
with max length 100: 

<xs:sequence dfdl:terminator="{if (fn:str-len(varstr) eq 100) then '%ES;' 
else '%DEL'}" ...> 
        <xs:element name="varstr" type="xs:string" 
dfdl:lengthKind="pattern" dfdl:pattern="([^\x7F].\x7F)|(.{100})" ... /> 
</xs:sequence> 

Can't put dfdl:terminator with a self-referencing expression on the 
element. Might need fn:exists in the dfdl:terminator expression to handle 
optionality. Does that work? 

I don't think this will work as %ES isn't allowed in terminators.
There is a proposal to allow it, but only when length kind is such that 
one is not scanning for delimiters (same restriction as for WSP*). Let's 
assume that we allow %ES for now.
SMH: This has been incorporated as an update to erratum 2.148 and is the 
latest spec draft.

One beauty of your idea here is that unparsing will "just work", so that's 
nice.

But I think your pattern has a bug: I think it should be 
dfdl:pattern="[^\x7F]{0,99}(?=\x7F)| .{100}"
This will not capture more than 99 characters prior to the DEL, and will 
not include the DEL as part of the string in the case where a DEL is found 
(uses lookahead in regex). Hence, the DEL will be available to be picked 
off as the terminator. Without this you end up with the DEL in the 
payload. 
With that I think your approach would work. So thanks for that idea. 
SMH: Yes my pattern was wrong, thanks for correcting.

Perhaps there is an even simpler way to model this, which will work today 
puts the conditional logic as a choice.

<choice>
       <!-- length kind pattern is needed to bound length to max of 99 -->
       <element name="raw1" type="xs:string" 
           dfdl:lengthKind='pattern' 
           dfdl:lengthPattern="[^\x7F]{0,99}" 
           dfdl:terminator="%DEL;"/>
       <element name="raw2" type="xs:string" 
            dfdl:lengthKind="explicit" 
            dfdl:length="100"/>
</choice>
<element name='value' type='xs:string' 
     dfdl:inputValueCalc='{ if (fn:exists( ../raw1 ) then ../raw1 else 
../raw2 }'/>

We still have to play the hidden group game though to hide raw1 and raw2. 

I have to think hard about how to handle a choice like this on unparsing 
though. I'm uncertain about how a dfdl:outputValueCalc on raw1 would 
conditionally fail, so that raw2 would be the selected output 
representation. We can't use an assertion as those aren't evaluated for 
unparsing.

SMH: There is no way to make a choice branch fail when unparsing. (The 
only 'backtracking' when unparsing a choice is when the infoset contains 
no branch at all then the spec states that each branch is examined in turn 
until one is found that successfully applies defaults. But that's not 
really backtracking, as you can statically deduce the branch from the 
schema alone, so the 'default' branch to use can be computed up front).

Steve Hanson
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848 
----- Forwarded by Steve Hanson/UK/IBM on 11/07/2014 13:09 ----- 

From:        Steve Hanson/UK/IBM 
To:        Mike Beckerle <mbeckerle.dfdl at gmail.com>, 
Cc:        "dfdl-wg at ogf.org" <dfdl-wg at ogf.org> 
Date:        08/07/2014 13:31 
Subject:        Re: [DFDL-WG] Action 233 (deferred) - "byte order not 
sufficient..." - draft document on experience with binary format 
MIL-STD-2045 

Mike 

Please find attached IBM's initial comments to your experience document, 
as Word comments.  We only got as far as the 3 x required extensions, not 
looked at the optional usability stuff in detail yet. 

We think we have our collective heads around the least significant bit 
ordering concept, but we think the explanation could be clearer and show 
the bits on-the-wire. Some debate as to whether this could be considered 
some variation of byteOrder but you've obviously thought this through and 
concluded a separate property is best. Also should bit order apply to text 
reps, given that byteOrder is binary rep only and any byte ordering 
variations in encodings are handled as separate encodings (eg, UTF-16LE 
and UTF-16BE). 

Regarding the US-ASCII-7-Bit-Packed encoding enum, this was added via 
erratum previously using the idea of DFDL-specific named encoding. But we 
are thinking that this could have been handled as an x- encoding, rather 
than specifically adding it to the spec.  And thinking further on that 
same thread, should byteOrder be made to work like encoding and allow x- 
enums, then the new byteOrder would become a x- enum.  The Wikipedia 
article you cite on Endianness mentions other byte orders (eg, 
Middle-Endian, PDP-Endian). 

Regards

Steve Hanson
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848 

From:        Mike Beckerle <mbeckerle.dfdl at gmail.com> 
To:        "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, 
Date:        24/06/2014 20:27 
Subject:        [DFDL-WG] Action 233 (deferred) - "byte order not 
sufficient..." - draft document on experience with binary format 
MIL-STD-2045 
Sent by:        dfdl-wg-bounces at ogf.org 

I have created an experience document about the "bit order" issue, which 
was a deferred action 233, and the subject of a public comment.

The document is here: http://redmine.ogf.org/dmsf_files/13268. The public 
comment item is http://redmine.ogf.org/boards/15/topics/43.

It recommends a new dfdl:bitOrder property, and a new dfdl:byteOrder enum 
value, without which it is impossible to model these data formats. It also 
recommends  several other improvements to DFDL to facilitate handling 
these data formats. 

The formats in question are a variety of MIL-STD formats which are all 
densely packed binary data. These formats are in broad use. MIL-STD-2045 
is one part of this family and this particular format specification is 
generally available without any restrictions from a US DoD web site (
http://assistdocs.com) so I made this specific format the subject of the 
document as it illustrates all the problematic issues.

We have implemented the dfdl:bitOrder property in Daffodil, and it works 
with some useful tests now passing. 

We have also enhanced our TDML implementation to enable creation of tests 
for this feature (and in the process actually found two bugs in the 
MIL-STD-2045 spec!). 

Both the property and this TDML enhancement are described in the document.

The sponsors of the Daffodil project are extremely keen to get this needed 
binary support into the DFDL v1.0 standard so as to have multiple DFDL 
implementations support it. 

...mikeb 

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com 
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject to the OGF Intellectual Property Policy 
--
 dfdl-wg mailing list
 dfdl-wg at ogf.org
 https://www.ogf.org/mailman/listinfo/dfdl-wg 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20140714/8c16d76d/attachment-0001.html>