[DFDL-WG] Required encodings and testing

Ian W Parkinson PARKIW at uk.ibm.com
Tue Jun 24 08:26:06 CDT 2008


Hi all,

I'd suggest that we only need worry about those character sets described 
at http://www.iana.org/assignments/character-sets. Are the ones beginning 
"x-" specific to ICU? I think this would simplify the matter of BOMs 
somewhat, as we wouldn't need to deal explicitly with character sets that 
must have a BOM (presumably the -BOM variants) and so make the 
'spec-twister' a non-issue.

Unicode BOMs would remain a complex issue, though. If the schema specifies 
encoding="UTF-16BE" or "UTF16-LE" then our behaviour is clear enough going 
by the spec at http://www.ietf.org/rfc/rfc2781.txt - we never generate a 
BOM, and any BOM encountered is treated as a character. If the schema 
specifies just "UTF-16" (in wihch the BOM is strictly optional) then we'd 
honour any BOM at the top of the text field, defaulting to the specified 
dfdl:byteOrder value. On unparse we can choose whether or not to include a 
BOM - I'd suggest we always include a BOM and use dfdl:byteOrder (*). If a 
particular schema needs to control this more explicitly then they can use 
an expression to compute UTF-16BE or UTF-16LE as appropriate.

That would leave the following edge-case: a schema which wants to generate 
BOMless data so specifies (e.g.) UTF-16LE, but wants to tolerate and 
honour any BOM present on parse. Do we need to deal with this unusual 
situation? It perhaps could be handled through an optional hidden field, 
but would we want to make it easier to achieve?


(*) the alternative would be to leave the byte order up to the 
implementation, potentially allowing data to be output with the endianness 
in which it was received. This may be beneficial in some situations but 
would leave the schema author without a way to specify the byteOrder while 
still requiring a BOM to be generated.


Cheers,

Ian

Ian Parkinson
WebSphere ESB Development
Mail Point 211, Hursley Park, Hursley, Winchester, SO21 2JN, UK



From:
"RPost" <rp0428 at pacbell.net>
To:
<dfdl-wg at ogf.org>
Date:
24/06/2008 01:58
Subject:
[DFDL-WG] Required encodings and testing



Thanks for the response re encodings and issues. Very helpful.
 
I put my responses in the attachment but here is the first part about 
encoding.
 
Your response: We haven't picked a basic set that all conforming 
implementations must support other than that UTF-8 and USASCII must be 
supported. We might require more than this though.
 
That?s a relief!
 
The current spec mentions UTF-8, ebcdic-cp-us (IBM037), and UTF-16BE.
 
Since Java 1.6 supports 160 encodings using 686 aliases I've no doubt you 
see the reason for my question about which encodings need initial support.
 
ICU supports even more encodings and requiring some of these could 
implicitly require implementors to support ICU. Not an issue if that is 
truly needed but that requirement alone could dissuade some from 
participating in the project.
 
The encodings I have examined/tested so far are: US-ASCII, ISO-8859-1, 
UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, IBM1047, 
IBM500, IBM037, x-UTF-16LE-BOM, X-UTF-32BE-BOM, X-UTF-32LE-BOM.
 
I have not run across any issues with any of the above encodings.
 
ICU includes 175 UCM files of which 135 are for SBCS encodings. I have not 
tested or examined all of these but would not expect them to be an issue 
either.
 
Also not examined are the 27 UCM files for MBCS encodings. A brief review 
shows that many of these should not be an issue.
 
BIG5 or GB18030 could definitely be an issue and there are several others 
like these that might require a custom effort to support. Ok if you really 
need it but better delayed initially if you don't.
 
Glad to know we don't need to visit these for the short term. I'm sure 
implementers would much rather concentrate on the DFDL aspect of things 
rather than become encoding experts. I know I would.
 --
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  http://www.ogf.org/mailman/listinfo/dfdl-wg






Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU





-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20080624/23b63680/attachment.html 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: encodings_reply1.txt
Url: http://www.ogf.org/pipermail/dfdl-wg/attachments/20080624/23b63680/attachment.txt 


More information about the dfdl-wg mailing list