[DFDL-WG] Required encodings and testing

Tue Jun 24 06:27:17 CDT 2008

Several people in the DFDL WG are hoping to use the ICU source code as part
of a DFDL implementation.

Some modifications will be necessary (number patterns are enhanced somewhat
in DFDL), but in general the hope is to not reinvent all the character set
encoding/decoding technology.

While it is true that DFDL does not want to require ICU implicitly from the
spec alone, the fact that the ICU is there, is open source, has an
appropriate license allowing general use, and has comprehensive encoding
support sort of removes the pressure to minimize encoding/decoding support
in DFDL or any other modern spec. It's not hard anymore to provide a quite
broad suite encodings. The hardest part is a test suite that illustrates
correct use of each.

.mike

  _____  

From: dfdl-wg-bounces at ogf.org [mailto:dfdl-wg-bounces at ogf.org] On Behalf Of
RPost
Sent: Monday, June 23, 2008 9:03 PM
To: dfdl-wg at ogf.org
Subject: [DFDL-WG] Required encodings and testing

Thanks for the response re encodings and issues. Very helpful.

I put my responses in the attachment but here is the first part about
encoding.

Your response: We haven't picked a basic set that all conforming
implementations must support other than that UTF-8 and USASCII must be
supported. We might require more than this though.

That's a relief!

The current spec mentions UTF-8, ebcdic-cp-us (IBM037), and UTF-16BE.

Since Java 1.6 supports 160 encodings using 686 aliases I've no doubt you
see the reason for my question about which encodings need initial support.

ICU supports even more encodings and requiring some of these could
implicitly require implementors to support ICU. Not an issue if that is
truly needed but that requirement alone could dissuade some from
participating in the project.

The encodings I have examined/tested so far are: US-ASCII, ISO-8859-1,
UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, IBM1047,
IBM500, IBM037, x-UTF-16LE-BOM, X-UTF-32BE-BOM, X-UTF-32LE-BOM.

I have not run across any issues with any of the above encodings.

ICU includes 175 UCM files of which 135 are for SBCS encodings. I have not
tested or examined all of these but would not expect them to be an issue
either.

Also not examined are the 27 UCM files for MBCS encodings. A brief review
shows that many of these should not be an issue.

BIG5 or GB18030 could definitely be an issue and there are several others
like these that might require a custom effort to support. Ok if you really
need it but better delayed initially if you don't.

Glad to know we don't need to visit these for the short term. I'm sure
implementers would much rather concentrate on the DFDL aspect of things
rather than become encoding experts. I know I would.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20080624/01127978/attachment-0001.html