[DFDL-WG] Action 059: External specification of encoding, byte order

Steve Hanson smh at uk.ibm.com
Thu Nov 12 05:07:23 CST 2009


As discussed on the call:

For case 1) the DFDL xsd always wins, and the context is ignored. If the 
user wants to use the encoding/byte order from the context, then he must 
be explicit about this and use case 2) above

Will adopt suggestion a).  One question - are there any other DFDL 
properties like dfdl:encoding and dfdl:byteOrder that are commonly 
provided by context?  How about dfdl:binaryFloatRepresentation, or 
dfdl:outputNewLine? 

Will not adopt suggestion b). 

Regards

Steve Hanson
Programming Model Architect, WebSphere Message  Brokers,
OGF DFDL WG Co-Chair,
Hursley, UK,
Internet: smh at uk.ibm.com,
Phone (+44)/(0) 1962-815848



From:
Steve Hanson/UK/IBM
To:
dfdl-wg at ogf.org
Date:
05/11/2009 14:30
Subject:
Action 059: External specification of encoding, byte order


DFDL schemas can either:

1) specify fixed encoding(s)/byte order(s) for the data being described, 
2) specify that the encoding/byte order is provided by the 'context' that 
invokes the DFDL processor (using the dfdl:defineVariable 'external' 
facility). **

For case 1), DFDL is faced with a problem. Namely what happens when the 
'context' provides an encoding/byte order for the data, but the DFDL xsd 
specifies a different encoding/byte order. I think DFDL must make a 
statement about this situation, as there are several common scenarios 
where this could occur (HTTP, MIME, MQ). 

It is worth looking at the precedent set by XML in this regards. The 
analogous problem for XML is where the XML document itself specifies a 
different encoding (using the ?xml declaration) to the context. The 
recommendations for XML are stated in the appendix below - there is no 
universal rule. 

It is more complicated with DFDL though.  A DFDL xsd can set up the 
encoding(s)/byte order(s) to use in several different places. Which of 
those would the context override? All of them?  Just the one associated 
with the top-level structure? 

My conclusion is therefore that for case 1) the DFDL xsd always wins, and 
the context is ignored. If the user wants to use the encoding/byte order 
from the context, then he must be explicit about this and use case 2) 
above.

There are two things that we could allow to be a bit more flexible:

a) Pre-define $encoding and $byteOrder variables in the DFDL namespace. 
These would implictly have 'external' = 'true' and perhaps a 
'defaultValue' as well.  This simplifies the coding of a DFDL xsd for case 
2).

b) State that it is an implementation decision to provide an option to use 
a context encoding/byte order for case 1) instead of the ones in the DFDL 
xsd. In such a case, the context MUST override all encodings/byte orders 
in the system of xsds used by the DFDL processor.  (In practice this is 
invariably a single encoding/byte order). . 

** (Might be more than encoding & byte order - for example MQ also allows 
float format to be provided by context)

Appendix: XML
The equivalent situation for XML is where the XML document specifies its 
own encoding via the ?xml declaration, and the context also provides the 
encoding. There is no single rule, in summary:
        - Basicaly if there is a higher level protocol, then that defines 
the rules.
        - Eg, for MIME content-type text/xml, the context encoding is 
used. If this is omitted,  the xml is assumed to be US-ASCII. The ?xml 
declaration encoding is not used.
        - Eg, for MIME content-type application/xml, the context encoding 
is used If this is omitted,  the ?xml declaration encoding is used.
        - For files (where there is no context encoding) use of the ?xml 
declaration encoding is recommended.

Note that in Message Broker, we always use the context encoding, as it 
should always be present. We never use the ?xml declaration.


W3C XML 1.0 spec section F.2 Priorities in the Presence of External 
Encoding Information
The second possible case occurs when the XML entity is accompanied by 
encoding information, as in some file systems and some network protocols. 
When multiple sources of information are available, their relative 
priority and the preferred method of handling conflict should be specified 
as part of the higher-level protocol used to deliver XML. In particular, 
please refer to [IETF RFC 3023] or its successor, which defines the 
text/xml and application/xml MIME types and provides some useful guidance. 
In the interests of interoperability, however, the following rule is 
recommended.
If an XML entity is in a file, the Byte-Order Mark and encoding 
declaration are used (if present) to determine the character encoding.


IETF RFC 3023

3.6 Summary

   The following list applies to text/xml, text/xml-external-parsed-
   entity, and XML-based media types under the top-level type "text"
   that define the charset parameter according to this specification:

   o  Charset parameter is strongly recommended.

   o  If the charset parameter is not specified, the default is "us-
      ascii".  The default of "iso-8859-1" in HTTP is explicitly
      overridden.

   o  No error handling provisions.

   o  An encoding declaration, if present, is irrelevant, but when
      saving a received resource as a file, the correct encoding
      declaration SHOULD be inserted.

   The next list applies to application/xml, application/xml-external-
   parsed-entity, application/xml-dtd, and XML-based media types under
   top-level types other than "text" that define the charset parameter
   according to this specification:

   o  Charset parameter is strongly recommended, and if present, it
      takes precedence.

   o  If the charset parameter is omitted, conforming XML processors
      MUST follow the requirements in section 4.3.3 of [XML].


Regards

Steve Hanson
Programming Model Architect, WebSphere Message  Brokers,
OGF DFDL WG Co-Chair,
Hursley, UK,
Internet: smh at uk.ibm.com,
Phone (+44)/(0) 1962-815848





Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU













Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU





-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20091112/bcaf418f/attachment.html 


More information about the dfdl-wg mailing list