[DFDL-WG] Action 292 - version 2 proposal for hexBinary with lengthUnits bits

Fri Dec 7 17:00:13 EST 2018

I went over this issue again mentally.

Here's what I came up with. Note I am using fixed-width font because of
some ascii-art in this email.

So one thing we realized the other day is we need at least this much
amending of the proposal.

Changing what xs:hexBinary means when dfdl:lengthUnits='bits' would be
binary incompatible. Right now there are schemas with xs:hexBinary in them
where dfdl:lengthUnits='bits' is in scope, but is being ignored because
DFDL v1.0 says it doesn't apply to hexBinary.

So at minimum we need a property to switch on bits-centric behavior for
xs:hexBinary.

Next, we know that XSD constrains things. The length facets are applicable
to hexBinary and are always measured in bytes.
Hence, lexically, there should only EVER be an even number of hex digits in
a hexBinary, and if the facets are present, then the length units cannot be
bits or the values of the facets would be misleading.

So even if the number of bits is 17, you should get 6 hex digits, not 5.

(I think XML validators may fail on odd number of hex digits. Not
necessarily all of them, but some may.)

Third, there's no debate that bitOrder matters. The question is only about
whether byteOrder should matter.

Given that then I think there are two possible interpretations of hexBinary.
I'll call them the "byte string" way, and the "binary number" way.

THE BYTE STRING WAY

The following would be invariants

* byte order doesn't matter ever
* if the hexBinary's representation is aligned to an 8-bit boundary and is
a muliple of 8-bits long, then the logical value is the same regardless of
bitOrder.

Consider this data stream as hex bytes DE AD BE EF.

Regardless of bit order, all 32 bits taken together, starting on a byte
boundary, the only hexBinary rep would be <foo>DEADBEEF</foo>

Now consider we start at bit 5 (1-based numbering) and proceed for only 24
bits. So we're not going to consume the first 4 bits, nor the last 4 bits.
Where first and last here are relative to the bitOrder.

When bitOrder is MSBF, we would want the data to be <foo>EADBEE</foo>

When bitOrder is LSBF, we would want the data to be <foo>DDEAFB</foo>
(Write the whole bytes backwards, drop first and last nibble, then reverse
again).

Now consider we start at bit 6 and proceed for 22 bits.

when bitOrder is MSBF we would want the data to be ....
  D    E    A    D    B    E    E    F
  1101 1110 1010 1101 1011 1110 1110 1111
  xxxx x110 1010 1101 1011 1110 111x xxxx
        D    5    B    7    D    C
<foo>D5B7DC</foo>
Note to get the final C, we had to extend the final byte with 2 zero bits,
and this is done by shift left/pad on right (least significant side)

when bit order is LSBF we wuld want the data to be....
  D    E    A    D    B    E    E    F
  1101 1110 1010 1101 1011 1110 1110 1111
reverse the bytes (not the nibbles, the bytes)
  E    F    B    E    A    D    D    E
  1110 1111 1011 1110 1010 1101 1101 1110
  xxxx x111 1011 1110 1010 1101 110x xxxx
         3    D    F    5    6    E
Now reverse the bytes again
<foo>6EF53D</foo>
Note to get the 3 in the final byte we had to assume 2 zero bits on the
left (most significant side).

In the above, we're effectively treating hexBinary as a sequence of 8-bit
integers, followed by a less-than-8-bit integer if the length is not a
mulitple of 8 bits, and this less than 8-bit integer gets adjusted to be a
full byte in a bitOrder aware way. We don't need byte order because we're
never considering a number that occupies more than 8-bits at a time.

THE BINARY NUMBER WAY

The second way to do hexBinary would be to effectively treat it as a minor
variation on a xs:nonNegativeInteger with binaryNumberRep='binary'.

In this case, if the bytes are DEADBEEF, and the byte order is bigEndian,
the string is <foo>DEADBEEF</foo>, but if byteOrder is littleEndian the
string is <foo>EFBEADDE</foo>

In this case byteOrder matters. Bit order didn't matter because we were
dealing with whole bytes.
We are always going to represent 2-digits for each byte of length (rounding
up for the final byte). So for 3 bytes, as if the textNumberPattern was
"000000".
So there will be leading zeros sometimes, (Also we use hex digits,... goes
without saying.)

If we consider the first example above, DEADBEEF where we remove first and
last nibbles, then

when bitOrder MSBF and byteOrder bigEndian - no change from above
when bitOrder MSBF and byteOrder littleEndian - <foo>EEDBEA</foo> (reversed
from above)
when bitOrder LSBF and byteOrder littleEndian - <foo>FBEADD</foo> (reversed
from above)
when bitOrder LSBF and byteOrder bigEndian (Not allowed in DFDL now) - no
change from above.

Revisiting the 22-bit long examples from above, but adding byteOrder to
them,

when bitOrder MSBF and byteOrder bigEndian - no change from above
when bitOrder MSBF and byteOrder littleEndian - <foo>DCB7D5</foo> (reversed
from above)
when bitOrder LSBF and byteOrder littleEndian - <foo>3DF56E</foo> (reversed
from above)
when bitOrder LSBF and byteOrder bigEndian (Not allowed in DFDL now) - no
change from above.

My evaluation of this is that the numeric treatment here is actually a bit
problematic because a hexBinary is not a number represented in base 16 -
conceptually it is a byte array.

If I look at the XML infoset, first pair of hex digits (leftmost) I expect
to be able to look at the data stream, and find that bit pattern. True I
must know the bitOrder. But if I throw byte order into the mix, I
potentially have to go to the end of the hexBinary (and these can be quite
big. Could be screenfuls or megabytes of data away) to find the hex digits
that correspond to the current location in the data stream.

This is no different than for a base 10 number, but because those are base
10 I'm never going to be doing that for a giant base 10 number.

Conclusion.

I see no advantage to the BINARY NUMBER way over the BYTE STRING way. It
changes what you get based on byte order which seems unnecessary.  I think
the added flexibility is not required.

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
<http://www.ogf.org/About/abt_policies.php>

On Tue, Dec 4, 2018 at 4:10 AM Steve Hanson <smh at uk.ibm.com> wrote:

> I agree that bitOrder is needed, not byteOrder.  If you want to parse the
> data as an integer, then fine but that is not the case here, you are
> parsing the data as hexBinary. The analogy is with your parsing of text
> strings where the encoding is one where the character size is not a
> multiple of 8 bytes; you use bitOrder but not byteOrder.
>
> Regards
>
> Steve Hanson
>
> IBM Hybrid Integration, Hursley, UK
> Architect, *IBM DFDL*
> <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> *smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
> mob:+44-7717-378890
> Note: I work Tuesday to Friday
>
>
>
> From:        Stephen Lawrence <slawrence at tresys.com>
> To:        Steve Hanson <smh at uk.ibm.com>, "mbeckerle.dfdl at gmail.com" <
> mbeckerle.dfdl at gmail.com>
> Cc:        DFDL-WG <dfdl-wg at ogf.org>
> Date:        30/11/2018 18:10
> Subject:        Re: [DFDL-WG] Action 292 - version 2 proposal for
> hexBinary with lengthUnits bits
> ------------------------------
>
>
>
> As an example of why I feel bitOrder and byteOrder apply if supporting
> hexBinary with non-byte size lengths or starting on non-byte boundaries,
> let's say we we had the following data:
>
>  11011111 11010001 = 0xDFD1
>
> And we want to model this as one 12-bit unsigned int followed by one
> 4-bit unsigned int, all with bitOrder=LSBF and byteOrder=LE. We would
> have a schema like so:
>
>  <dfdl:format
>    lengthKind="explicit"
>    lengthUnits="bits"
>    bitOrder="leastSignifigantBitFirst"
>    byteOrder="littleEndian" />
>
>  <xs:sequence>
>    <xs:element name="foo" dfdl:length="12" type="xs:unsignedInt" />
>    <xs:element name="bar" dfdl:length="4" type="xs:unsignedInt" />
>  </xs:sequence>
>
> The above data would parse as:
>
>  <foo>479</foo> <!-- binary: 000111011111, hex 0x1DF -->
>  <bar>13</bar> <!-- binary: 1101, hex 0xD -->
>
> This is because due to the bit/byteOrder, "foo" is made up of the last
> four bits in second byte (0001) followed by the first eight bits of the
> first byte (11011111), resulting in a value of 479. The bitPosition
> after "foo" is consumed is 12. Then "bar" consumes the remaining bits,
> which are the first four of the second byte, resulting in a value of 13.
>
> This all follows the specification as-is.
>
>
> Now, let's assume we instead wanted to represent "foo" as xs:hexBinary
> that has a non-byte size length, e.g.:
>
>  <xs:sequence>
>    <xs:element name="foo" dfdl:length="12" type="xs:hexBinary" />
>    <xs:element name="bar" dfdl:length="4" type="xs:unsignedInt" />
>  </xs:sequence>
>
> If we ignored bitOrder/bytOrder when parsing "foo" read the first 12
> bits (essentially BE MSBF), the result would be:
>
>  <foo>0DFD</foo>
>
> But just like before, the bitPosition after "foo" is consumed is 12. And
> because the bit/byteOrder is LSBF LE, the bits that "bar" will consume
> are again the first four of the second byte, with the result
>
>  <bar>13</bar>
>
> But this means that the last four bits in the data (0001) were never
> consumed, and the first four bits in the second byte were consumed
> twice, which must be wrong (a similar issue occurs when starting on a
> non-byte boundary). So bitOrder/byteOrder must be taken into account
> somehow in order to support hexBinary with non-bytesize lengths or
> starting on a non-byte boundary, primarily because of how bitOrder=LSBF
> works (which I believe was the original use-case for non-byte size
> non-byte boundary hexBinary).
>
> If instead we do not ignore bit/byteOrder, there must be some way to
> determine how to get those bits into a hexBinary representation. There
> are probably a few different ways to handle this, but after some
> discussions and interpretations of the XSD spec, we determined that the
> best way to handle it was to just read the bits as if they were a
> nonNegativeInteger (which does take into account bit/byteOrder) and then
> convert those bits to a hex representation. For BE MSBF the result is
> exactly the same. For LE MBSF, it results in the hexBinary being
> flipped, which is where the Daffodil implementation is inconsistent with
> spec.
>
>
>
>
> On 11/29/18 10:19 AM, Steve Hanson wrote:
> > Mike
> >
> > I'm a bit lost on this now.  The concept of applying lengthUnits='bits'
> to
> > xs:hexBinary is straightforward. It just counts bits. Bit order or byte
> order is
> > irrelevant, in the same way that it is irrelevant when counting bytes
> for a hex
> > binary. The only thing to note is that the fillByte needs to be used to
> make up
> > whole bytes.
> >
> > I'm missing something here.
> >
> > Regards
> >
> > Steve Hanson
> >
> > IBM Hybrid Integration, Hursley, UK
> > Architect, _IBM DFDL_ <
> http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> > Co-Chair, _OGF DFDL Working Group_ <http://www.ogf.org/dfdl/>_
> > __smh at uk.ibm.com_ <mailto:smh at uk.ibm.com <smh at uk.ibm.com>>
> > tel:+44-1962-815848
> > mob:+44-7717-378890
> > Note: I work Tuesday to Friday
> >
> >
> >
> > From: Mike Beckerle <mbeckerle.dfdl at gmail.com>
> > To: DFDL-WG <dfdl-wg at ogf.org>
> > Date: 20/11/2018 17:33
> > Subject: [DFDL-WG] Action 292 - version 2 proposal for hexBinary with
>
> >   lengthUnits bits
> > Sent by: "dfdl-wg" <dfdl-wg-bounces at ogf.org>
> >
> >
> --------------------------------------------------------------------------------
> >
> >
> >
> > Users want a way to express an arbitrary unaligned string of bits, with
> the
> > appearance in the infoset being hexadecimal, not base 10.
> >
> > Right now the only way I can see to meet this requirement while
> retaining
> > backward compatibility would be a new DFDL property.
> >
> > So here's the new idea:
> >
> > Property dfdl:hexBinaryRep with values 'bytes' or 'bits'. New property,
> so
> > defaulting (with suppressible warning) to 'bytes' for backward
> compatibility in
> > schemas not having the property.
> >
> > When set to 'bits', then type xs:hexBinary would behave just like
> > xs:nonNegativeInteger, and all properties relevant to that type would be
> > applicable, and any use of XSD length facets on such elements would be
> an SDE.
> > The hexBinary string would be exactly same as if you took the numeric
> value for
> > a nonNegativeInteger and instead of presenting it as base 10 digits, you
> use
> > base 16 digits.
> >
> >
> > Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
> > _www.tresys.com_ <http://www.tresys.com>
> > Please note: Contributions to the DFDL Workgroup's email discussions are
> subject
> > to the _OGF Intellectual Property Policy_
> > <http://www.ogf.org/About/abt_policies.php>
> > --
> >   dfdl-wg mailing list
> >   dfdl-wg at ogf.org
> > https://www.ogf.org/mailman/listinfo/dfdl-wg
> >
> > Unless stated otherwise above:
> > IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
> 3AU
> >
> >
> > --
> >   dfdl-wg mailing list
> >   dfdl-wg at ogf.org
> >   https://www.ogf.org/mailman/listinfo/dfdl-wg
> >
>
>
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20181207/085e2b5c/attachment-0001.html>