[DFDL-WG] Required encodings and testing

Wed Jun 25 10:48:43 CDT 2008

Very interesting paragraph here from  <http://www.ietf.org/rfc/rfc2781.txt>
http://www.ietf.org/rfc/rfc2781.txt, emphasis mine.

   It is important to understand that the character 0xFEFF appearing at
   any position other than the beginning of a stream MUST be interpreted
   with the semantics for the zero-width non-breaking space, and MUST
   NOT be interpreted as a byte-order mark. The contrapositive of that
   statement is not always true: the character 0xFEFF in the first
   position of a stream MAY be interpreted as a zero-width non-breaking
   space, and is not always a byte-order mark. For example, if a process
   splits a UTF-16 string into many parts, a part might begin with
   0xFEFF because there was a zero-width non-breaking space at the
   beginning of that substring.

In DFDL, we have no way of knowing whether a string is supposed to be the
real beginning of "a stream", or is some chunk of the middle of something.
For that reason it is consistent for DFDL to ALWAYS interpret 0xFEFF as a
ZWNBS, and never as a BOM. 

So if you want BOM behavior it's because the beginning of a stream has
special treatment, in this case it is reasonable to model the BOM as a
separate element to be found at the beginning of a "stream", optionally
hidden, perhaps optional, and compute dfdl:byteOrder in terms of its value. 

I think this position is pretty well supported by the above paragraph from
rfc2781.

.mike

  _____  

From: Steve Hanson [mailto:smh at uk.ibm.com] 
Sent: Wednesday, June 25, 2008 8:55 AM
To: mbeckerle.dfdl at gmail.com
Cc: dfdl-wg at ogf.org
Subject: Re: [DFDL-WG] Required encodings and testing

Some interesting and official stuff about BOMs here.
http://unicode.org/faq/utf_bom.html 

In IBM WMB we do see some XML UTF-16 data arriving with a BOM on the front
of a file/message, and we handle that. What we don't handle though is the
occurrence of a BOM part way through the file/message. I'm pretty sure it
would be treated as an ordinary code point. 

Regards, Steve

Steve Hanson
Programming Model Architect
WebSphere Message Brokers
Hursley, UK
Internet: smh at uk.ibm.com
Phone (+44)/(0) 1962-815848 

"Mike Beckerle" <mbeckerle.dfdl at gmail.com> 
Sent by: dfdl-wg-bounces at ogf.org 

25/06/2008 12:37 

Please respond to
mbeckerle.dfdl at gmail.com

To

Ian W Parkinson/UK/IBM at IBMGB, <dfdl-wg at ogf.org> 

cc

Subject

Re: [DFDL-WG] Required encodings and testing

I'm a little confused. I think the language is there in the spec: 

Spec v32 says: 

encoding 

Enum. 

Values are IANA charsets or CCSID[MJB1] [1]s. 

This property can be computed by way of an expression which returns the
appropriate string. 

Note that there is, deliberately, no concept of 'native' encoding[2]. 

Conforming DFDL v1.0 processors must accept at least 'UTF-8'', "UTF-16",
"UTF-16BE', "UTF-16LE', "ASCII", and 'ISO-8859-1' [MJB2] as encoding names.
Encoding names are case-insensitive, so "utf-8" and "UTF-8" are equivalent.
The "UTF-16" encoding requires that dfdl:byteOrder is defined. 

Annotation: dfdl:format

In the references it lists: 

IANA character set encoding names:
(http://www.iana.org/assignments/character-sets) 

I agree that neither the minimum list above in the box, nor the reference to
the IANA list are sufficient. 

I did not find any "x- " character sets described here. 

Also, searching the IANA list I find no mention of BOM. So what list of
encodings are you referring to? 

I did find mention of UTF-16/UCS-2 requiring a BOM in the ICU. This may be a
usage pattern that ICU supports, and if I were coding up something hoping it
would be useful this might be what I would have done too; however, I have
not seen data with this behavior, so I have to question whether this is in
any way in use anywhere. Is it? 

While we're trimming options on encodings, with some web searching I wasn't
able to find the standard for CCSID other than at an IBM web site. So while
there is a CCSID for iso-8859-1 encoding, that doesn't mean CCSIDs are an
ISO standard, rather just that they have some conformant sets. 

Based on this I suggest dropping CCSID support since it is a vendor standard
only (If I'm correct.)  If this is, however, a de-facto standard even
outside of IBM context then I'll retract this suggestion. 

W.r.t. BOMs, I spent quite a lot of time on BOMs, mostly due to the hassle
that Unicode specifically says they are not characters; hence, I was
shooting for a semantics where a 10 character string could have either 10 or
11 codepoints in it due to a BOM being present or absent thereby turning
many fixed length things into variable length. Length determination gets
pretty complex if you do this. You have to look at quite a few properties
just to decide whether something is fixed or variable length. 

The last proposal before we dropped BOMs altogether was to have a special
character set UTF-16-VL (for variable length) which means there may or may
not be a BOM. We concluded that this doesn't belong in DFDL, I do think the
right way to solve this BOM problem is with identification of encodings that
allow/require/prohibit use of BOMs since a BOM is not a character it must be
part of the character set encoding. E.g., UTF-16-BOM-required,
UTF-16-BOM-prohibited, UTF-16-BOM-allowed, etc. Somebody other than DFDL
should pick the names. The same issue comes up with the UTF-16 with and
without the surrogate-pairs crud. I.e., do you want number of codepoints or
do you want the surrogate-pairs considered to be one character. We used to
have a lengthUnits="fullUnicodeCharacters" to specify this behavior. This
has been dropped as too complex also. Again UTF-16-VL was the last suggested
way to fix this, i.e., VL for variable length meaning interpret the BOMs,
the surrogate pairs, etc. 

One other issue of this kind is the weird variant of utf-8 where surrogate
pairs are encoded as 3 bytes each rather than using the 4-byte standard
utf-8 way of encoding a 20-bit character code. Again, this should be a new
character set encoding name. E.g., utf-8-encoded-surrogate-pairs. There's
java's funny utf-8 variant also where zero is encoded as 2 bytes also. These
are all issues where there is a funny encoding but no standard IANA name for
it. 

If someone would like to co-author a suggestion for some new IANA charset
encoding names to propose to whomever that is, I would happily contribute. 

At this point, I'm pretty convinced that we should just say for DFDL v1 a
BOM is a codepoint and we treat it like any other codepoint. 

I also haven't seen any real use of BOMs. In memory people use native forms
and don't have these, and externally UTF-8 seems preferred. I'd like to hear
of real BOM usage examples. 

I also don't think we "have to" support them in that a BOM can be treated
like an optional element that might or might not exist before a string.
Using a combination of valueCalc properties and defaults and a calculated
value for the byteOrder property one can, I believe, achieve every
combination of optional or required BOM and generate them on output or omit
in whatever situations. It will be clumsy, but I prefer this to putting a
bunch of speculative features into the standard where we don't really have a
strong usage model in mind. 

.mike 

  _____  

From: dfdl-wg-bounces at ogf.org [mailto:dfdl-wg-bounces at ogf.org] On Behalf Of
Ian W Parkinson
Sent: Tuesday, June 24, 2008 9:26 AM
To: dfdl-wg at ogf.org
Subject: Re: [DFDL-WG] Required encodings and testing 

Hi all, 

I'd suggest that we only need worry about those character sets described at
<http://www.iana.org/assignments/character-sets>
http://www.iana.org/assignments/character-sets. Are the ones beginning "x-"
specific to ICU? I think this would simplify the matter of BOMs somewhat, as
we wouldn't need to deal explicitly with character sets that must have a BOM
(presumably the -BOM variants) and so make the 'spec-twister' a non-issue. 

Unicode BOMs would remain a complex issue, though. If the schema specifies
encoding="UTF-16BE" or "UTF16-LE" then our behaviour is clear enough going
by the spec at  <http://www.ietf.org/rfc/rfc2781.txt>
http://www.ietf.org/rfc/rfc2781.txt - we never generate a BOM, and any BOM
encountered is treated as a character. If the schema specifies just "UTF-16"
(in wihch the BOM is strictly optional) then we'd honour any BOM at the top
of the text field, defaulting to the specified dfdl:byteOrder value. On
unparse we can choose whether or not to include a BOM - I'd suggest we
always include a BOM and use dfdl:byteOrder (*). If a particular schema
needs to control this more explicitly then they can use an expression to
compute UTF-16BE or UTF-16LE as appropriate. 

That would leave the following edge-case: a schema which wants to generate
BOMless data so specifies (e.g.) UTF-16LE, but wants to tolerate and honour
any BOM present on parse. Do we need to deal with this unusual situation? It
perhaps could be handled through an optional hidden field, but would we want
to make it easier to achieve? 

(*) the alternative would be to leave the byte order up to the
implementation, potentially allowing data to be output with the endianness
in which it was received. This may be beneficial in some situations but
would leave the schema author without a way to specify the byteOrder while
still requiring a BOM to be generated. 

Cheers, 

Ian 

Ian Parkinson
WebSphere ESB Development
Mail Point 211, Hursley Park, Hursley, Winchester, SO21 2JN, UK 

From: 

"RPost" <rp0428 at pacbell.net> 

To: 

<dfdl-wg at ogf.org> 

Date: 

24/06/2008 01:58 

Subject: 

[DFDL-WG] Required encodings and testing

  _____  

Thanks for the response re encodings and issues. Very helpful. 

I put my responses in the attachment but here is the first part about
encoding. 

Your response: We haven't picked a basic set that all conforming
implementations must support other than that UTF-8 and USASCII must be
supported. We might require more than this though. 

That's a relief! 

The current spec mentions UTF-8, ebcdic-cp-us (IBM037), and UTF-16BE. 

Since Java 1.6 supports 160 encodings using 686 aliases I've no doubt you
see the reason for my question about which encodings need initial support. 

ICU supports even more encodings and requiring some of these could
implicitly require implementors to support ICU. Not an issue if that is
truly needed but that requirement alone could dissuade some from
participating in the project. 

The encodings I have examined/tested so far are: US-ASCII, ISO-8859-1,
UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, IBM1047,
IBM500, IBM037, x-UTF-16LE-BOM, X-UTF-32BE-BOM, X-UTF-32LE-BOM. 

I have not run across any issues with any of the above encodings. 

ICU includes 175 UCM files of which 135 are for SBCS encodings. I have not
tested or examined all of these but would not expect them to be an issue
either. 

Also not examined are the 27 UCM files for MBCS encodings. A brief review
shows that many of these should not be an issue. 

BIG5 or GB18030 could definitely be an issue and there are several others
like these that might require a custom effort to support. Ok if you really
need it but better delayed initially if you don't. 

Glad to know we don't need to visit these for the short term. I'm sure
implementers would much rather concentrate on the DFDL aspect of things
rather than become encoding experts. I know I would. 
--
dfdl-wg mailing list
dfdl-wg at ogf.org
 <http://www.ogf.org/mailman/listinfo/dfdl-wg>
http://www.ogf.org/mailman/listinfo/dfdl-wg 

  _____  

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 

  _____  

[1] CCSID stands for Coded Character Set ID, a 3 digit representation for a
codepage specifier. TBD: cite relevant standard for CCSIDs here. 

[2] The concept of native character encoding is avoided in DFDL since a DFDL
schema containing such a property binding does not contain a complete
description of data, but rather an incomplete one which is parameterized by
characteristics of the operating environment where the DFDL processor
executes. In DFDL this same behavior is achieved through use of true
parameterization, for example by use of Selectors to choose among
annotations specifying different character set encoding property bindings. 

  _____  

 [MJB1]Cite a standard for CCSID values in the footnote. 

 [MJB2]We want this to be as small as possible a set. Can we get away with
just UTF-8, 

Also TBD: what aliases of the IANA names are required? All of them? So,
e.g., "Latin1" is accepted? --
 dfdl-wg mailing list
 dfdl-wg at ogf.org
 http://www.ogf.org/mailman/listinfo/dfdl-wg 

  _____  

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20080625/81000162/attachment-0001.html