[DFDL-WG] Clarification on UTF-16 and UTF-32 encoding byte order

Tue Apr 28 11:00:01 EDT 2020

So the important use case is this:

The data is dfdl:representation='text', the encoding is utf-16 but a BOM
tells us the byte order - but there is exactly one BOM, and it is at the
start of the file.
That BOM wants to be read, to stick, and to guide recreation of this data
in any associated unparse.

That special case is why we had this unicode byte order mark feature on the
document level.

What we failed to appreciate is that the byte order of this data will not
vary day to day, but will nearly always be constant. Data coming from
big-endian systems will be big-endian, and from little-endian systems will
be little endian.

So the use case for a schema that needs to adapt to either is a more rare
case. That's why the features we used to have were overkill, because in the
vast number of cases above, the users data will always have the same byte
order, because it is coming from one system.

Where we are today: we have already modified the DFDL spec draft to remove
everything about byte order marks EXCEPT, we didn't remove support for
UTF-16 or UTF-32 where a BOM might come in handy.

I think to fix this it's either plan A or plan B.

Plan (A) - Keep it simple - Just disallow utf-16 and utf-32 without
byte-order specifiers - make people use the more specific encodings that
specify byte order. If they in fact have data which varies in byte order
from instance to instance, they have to model that... just as they would
for binary data with that behavior.  (We can supply this as sample code.)

Plan (B) go back to what we had before. All of it. Even though nobody
implemented it nor wants to.

My preference is plan (A). I think this is entirely sufficient for DFDL
v1.0.

There's one other Plan (C) option, which would be to document that Utf-16
unadorned means this: accept the BOM, keep it as a character in the string,
and use it on parse to interpret the rest of the characters. It would also,
preserve a BOM character if present for unparse, but unparse would always
be big endian - the BOM written (only written if the character is present
at start of string) will be written as a Big-endian BOM. If not present,
none is added. The other characters are always written big-endian.  This is
the "converts to BE" model. It's what the java utf16 encoders/decoders do
if you do nothing special to force them to behave any particular way.

Thoughts?

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Owl Cyber Defense |
www.owlcyberdefense.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
<http://www.ogf.org/About/abt_policies.php>

On Tue, Apr 28, 2020 at 4:13 AM Steve Hanson <smh at uk.ibm.com> wrote:

> Section 11.1 was where BOMs were discussed, and it said:
>
> UTF-16.  If a BOM is found then this is used to set the document
> information item *[unicodeByteOrderMark]* member, and all data with
> dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have
> the implied byte order. If no BOM is found then all data with
> dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have
> big-endian byte order. There is no need to model the BOM explicitly.
>
> UTF-32.  If a BOM is found then this is used to set the document
> information item *[unicodeByteOrderMark]* member, and all data with
> dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have
> the implied byte order . If no BOM is found then all data with
> dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have
> big-endian byte order. There is no need to model the BOM explicitly.
>
>
> Same for unparsing.
>
> So it looks like we threw the baby out with the bath water when the
> section was removed!
>
> Regards
>
> Steve Hanson
>
> IBM Hybrid Integration, Hursley, UK
> Architect, *IBM DFDL*
> <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> *smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
> mob:+44-7717-378890
> Note: I work Tuesday to Friday
>
>
>
> From:        Mike Beckerle <mbeckerle.dfdl at gmail.com>
> To:        DFDL-WG <dfdl-wg at ogf.org>
> Date:        23/04/2020 20:23
> Subject:        [EXTERNAL] [DFDL-WG] Clarification on UTF-16 and UTF-32
> encoding byte order
> Sent by:        "dfdl-wg" <dfdl-wg-bounces at ogf.org>
> ------------------------------
>
>
>
> Since we dropped the Unicode byte order mark functionality from DFDL v1.0,
> the issue arises of what byte order is used when dfdl:encoding="utf-16" or
> dfdl:encoding="utf-32".
>
> We are clear that encodings define their own byte and bit order, the
> dfdl:byteOrder property is not used.
>
> There are these options:
> 1) explicitly disallow these encoding names because they do not specify a
> byte order. Require utf-16BE or utf-16LE, utf-32BE or utf-32LE.
> 2) specify that these are synonyms for the BE versions
> 3) specify that these are synonyms for the LE versions
>
> This comes up in the definition of the dfdl:byteOrder property where the
> text currently says:
>
> This property is never used to establish the byte order for text /strings
> with Unicode fixed-width encodings that do not specify the byte order
> (UTF-16 and UTF-32).
>
> Having removed the unicode byte order mark feature, this statement leaves
> us without a stipulation of how UTF-16 and UTF-32 byte order would be
> determined.
>
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Owl Cyber Defense |
> *www.owlcyberdefense.com* <http://www.owlcyberdefense.com>
> Please note: Contributions to the DFDL Workgroup's email discussions are
> subject to the *OGF Intellectual Property Policy*
> <http://www.ogf.org/About/abt_policies.php>
> --
>  dfdl-wg mailing list
>  dfdl-wg at ogf.org
>  https://www.ogf.org/mailman/listinfo/dfdl-wg
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20200428/e5307d4e/attachment-0001.html>