[DFDL-WG] Clarification on UTF-16 and UTF-32 encoding byte order

Steve Hanson smh at uk.ibm.com
Tue Apr 28 04:13:30 EDT 2020


Section 11.1 was where BOMs were discussed, and it said:

UTF-16.  If a BOM is found then this is used to set the document 
information item [unicodeByteOrderMark] member, and all data with 
dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have 
the implied byte order. If no BOM is found then all data with 
dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have 
big-endian byte order. There is no need to model the BOM explicitly. 
UTF-32.  If a BOM is found then this is used to set the document 
information item [unicodeByteOrderMark] member, and all data with 
dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have 
the implied byte order . If no BOM is found then all data with 
dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have 
big-endian byte order. There is no need to model the BOM explicitly. 

Same for unparsing. 

So it looks like we threw the baby out with the bath water when the 
section was removed! 

Regards
 
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday 



From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     DFDL-WG <dfdl-wg at ogf.org>
Date:   23/04/2020 20:23
Subject:        [EXTERNAL] [DFDL-WG] Clarification on UTF-16 and UTF-32 
encoding byte order
Sent by:        "dfdl-wg" <dfdl-wg-bounces at ogf.org>



Since we dropped the Unicode byte order mark functionality from DFDL v1.0, 
the issue arises of what byte order is used when dfdl:encoding="utf-16" or 
dfdl:encoding="utf-32".

We are clear that encodings define their own byte and bit order, the 
dfdl:byteOrder property is not used.

There are these options:
1) explicitly disallow these encoding names because they do not specify a 
byte order. Require utf-16BE or utf-16LE, utf-32BE or utf-32LE.
2) specify that these are synonyms for the BE versions
3) specify that these are synonyms for the LE versions

This comes up in the definition of the dfdl:byteOrder property where the 
text currently says:

This property is never used to establish the byte order for text /strings
with Unicode fixed-width encodings that do not specify the byte order
(UTF-16 and UTF-32).

Having removed the unicode byte order mark feature, this statement leaves 
us without a stipulation of how UTF-16 and UTF-32 byte order would be 
determined.

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Owl Cyber Defense | 
www.owlcyberdefense.com
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject to the OGF Intellectual Property Policy
--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ogf.org_mailman_listinfo_dfdl-2Dwg&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=Fcdy3gjLiFedSAXDcPBeT7yEZ8U0hJgpMGhShem7wkg&s=BbM0rc3sw8Jp8g76MANRRquB3lhxoFgJHezX9OEzJ10&e= 



Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20200428/60f316ec/attachment.html>


More information about the dfdl-wg mailing list