[DFDL-WG] Required encodings and testing

Tue Jun 24 21:02:47 CDT 2008

Ian raises a couple of interesting issues.

1. Exactly how should encodings be specified?

I like the idea of using http://www.iana.org/assignments/character-sets if
possible.

Even here the ICU *.ucm files don't always use the 'Name' specified in the
standard

and the standard does not include the 'x-' (experimental - meaning not a
standard).

Java mostly uses the standard 'Name' value but doesn't always match the ICU
name.

Even using the standard the 'Name' itself may not be the preferred MIME
name.

            Name: Extended_UNIX_Code_Packed_Format_for_Japanese

            MIBenum: 18

            Alias: csEUCPkdFmtJapanese

            Alias: EUC-JP (preferred MIME Name)

Java supports the name and both aliases. ICU has several in their
'convrtrs.txt' file including

X-EUC-JP.

The standard says 'The MIBenum value is a unique value for use in MIBs to
identify coded

character sets.'

Perhaps using the standard 'MIBenum' value for uniqueness and the 'Name' and
any or all

of the aliases would work. I'm guessing that an appendix will ultimately
provide the list?

2. For DFDL, when there is a conflict, what is more important: adherance to
a standard?

Or providing schema writers to express requirements unambigously?

This line in the rfc2781 link Ian provides caught my eye: '...addresses the
issues of

serializing UTF-16 as an octet stream for transmission over the Internet'.

Does this apply since DFDL isn't really targeting 'transmission over the
Internet'? You can't

transmit a binary file (whether it includes UTF-16 or not) over the internet
without

converting it first; often to BASE64.

So I'm not sure the standard for when to include/exclude BOMs applies. I
would suggest that

it if critical to allow a schema writer to specify exactly what to expect on
parse and

what is allowed on unparse.

As long as the writer can do that using a supported encoding and possibly
DFDL properties,

such as byteOrder we're covered. You may very well need to provide a way to
explicitly

specify whether a BOM 'is/is not/might be' present and whether a BOM 'must
be/can be'

written on output.

I didn't mention it in my original post but the BOM issue is one of the
related issues

that Addison Phillips ran into writing classes to serialize text into fixed
width fields.

Namely: for a fixed-width field (width in bytes) how do you determine how
many text characters

of a specified encoding will fit into the field? He had to take into account
BOMs as well

as ensuring that complete 'shift in - shift out' sequences could be written
without overflow.

Then he had a similar issue to figure out the padding.

In the consulting I do my assumption is that data, legacy and otherwise. I
usually have no

problem proving it even if the user Insists it isn't; it doesn't necessarily
obey the

business rules that it is supposed to. That is the #1 problem I run into as
an ETL consultant.

Fields are NULL that shouldn't be. A name field in one system is
VARCHAR2(30) and on another

the same field is VARCHAR2(40). What do you do with data moving from a '40'
to a '30'?

So for DFDL to support the parsing/reading of old legacy data, possibly
because the

original tools don't exist anymore, a schema writer has to be able to
explicitly control

how the data is interpreted whether it meets the standards or not.

3. To BOM or not to BOM - that is the question.

Use Ian's proposal for doing the standard thing in the standard way. But
ensure that there is

at least some way for a schema writer control explicitly what parse/unparse
will do. I wouldn't

be inclined to add anything to the spec at this point without a specific use
case that

requires it.

Speaking of which - did anyone ever locate the links on your site where I
can find some of

your use case descriptions or discussions?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20080624/1c11eea6/attachment.html