[DFDL-WG] Initial list of Required encodings for DFDL version 1

Mon Jun 23 08:17:14 CDT 2008

Responses interspersed below in this color.

  _____  

From: dfdl-wg-bounces at ogf.org [mailto:dfdl-wg-bounces at ogf.org] On Behalf Of
RPost
Sent: Sunday, June 22, 2008 7:10 PM
To: dfdl-wg at ogf.org
Subject: [DFDL-WG] Initial list of Required encodings for DFDL version 1

Q - What is the current thinking for the character set encodings that MUST
be implemented by a conforming DFDL processor for version 1?

We haven't picked a basic set that all conforming implementations must
support other than that UTF-8 and USASCII must be supported. We might
require more than this though.

I have been performing tests with length-prefixed strings and strings using
terminators to see what issues affect the ability to detect the boundaries
between strings and binary data or terminator strings that immediately
follow the string.

For length-prefixed strings you need to be able to either encode the byte
array and iterate the string character by character or perform byte counting
using only the byte stream and the bit ranges in the bytes themselves..

Issue #1 - It will not be trivial to create all of the test cases to fully
test the corner cases for each encoding. Obviously the fewer encodings that
have to be supported initially the better in terms of implementation.

I don't understand your concern here. Yes, there are a few cases to test.
E.g., length measured in bytes, but character set is variable width (like
utf-8 or shift-JIS) means number of characters is <= number of bytes. Length
in characters, and variable width character set means number of bytes >=
number of characters. All other cases you don't need to know anything other
than the character set width. These test cases are easy to enumerate.

All the above must be supported unless we allow a conforming DFDL
implementation to support only say, single-byte USASCII. 

Issue #2 - There is no current support for byte counting in Java or ICU. For
encodings that are pure single-byte or pure multi-byte the end of the string
can be found by examining the byte string itself without performing
character encoding. The classes available all perform conversions of entire
buffers (or series of buffers) and the classes also consume large amounts of
the byte stream.

The above are not limitations we can consider. Yes, data format support is
inadequate in these systems. That's why we need a standard here, because it
is too hard for people to implement and they need the reassurance of a
standard in order to justify the investment.

For some encodings (e.g. UTF-8) an algorithmic process can examine byte
values and determine if a character consumes 1, 2 or more bytes. 

Still other encodings will need to have custom processes written to either
encode and iterate the string or use a specially designed table to perform
byte counting.

This is true and I don't see this as a problem.

As with issue #1 the fewer encodings needing special handling that need to
be supported initially the fewer problems for implementers.

To me the minimum interesting set of encodings is utf-8, usascii,
ebcdic-cp-1, iso-8859-1, utf-16BE, utf-16LE. Without these there is a huge
amount you cannot do. 

It's unlikely we'll get DFDL through standardization without also including
the important international sets for both Europe (iso-8859-N for various N)
and Asia.

Issue #3 - some encodings have multiple possible byte representations for
the same character. If a terminator string specified as 'END' in a DFDL
property it must be converted to the proper encoding when searching for it.
The easiest way to do this is to encode it, convert the encoded value to a
byte array and then search the input stream byte array for a match. The
binary file could include bytes that express one encoding of the character
and the Java code could be searching for the character using another byte
representation.

Careful. You can't just search the data for a pattern as you may get a false
match on binary data. 

DFDL does handle the above issues with it's character entities system. 

Q - Does the DFDL spec need to allow a terminator to be specified as a hex
byte array so that the exact byte sequence to search for can be specified?

Yes. "foo%x66;bar" looks for the hex byte 66 after the "o" and before the
"b". Note that in a 2 byte character encoding one must put two bytes in
here. The entity inserts only a single uninterpreted byte.

Issue #4 - If a string can be specified as using one encoding and a
terminator can use a different encoding

This kind of thing gets discussed sometimes. This is simply an ambiguous
concept unless there is some other way of knowing the length. If the
terminator you mention is actually the delimiter for the length of the
string, then this concept is broken. If the terminator is just more data
found after a  say, fixed length, string, then there is no problem here as
the DFDL system would know when to change encodings.

 is it possible that the terminator byte sequence is also a valid string
byte sequence even though the characters being represented are different? I
haven't been able to determine if this can happen.

Q - Does the DFDL spec need to disallow different encodings for strings and
terminators for version 1? Or are you confident that this corner case is
unlikely to be an issue.

We have recently discussed something called "variable markup" which can
express all these corner cases. We've decided that separate encoding control
for delimiters is too obscure. We allow case-sensitivity control for
delimiters, but anything beyond that uses variable markup. 

I have been in contact with Addison Phillips, the current chair of the W3C
Internationalization core WG, and he ran into many of the above issues when
implementing character set providers for WebMethos (since consumed by
SoftwareAG). He also referred me to a contact at ICU and I hope to hear from
them in the next week or two.

Meanwhile, any thoughts or suggestions you have on the above would be
appreciated.

While I am waiting for feedback from ICU and Addison I am trying to
determine an effective way to set up an automated test harness that can be
used to generate different combinations of strings, terminators and
encodings and perform volume testing. Mike suggested using the test example
he provided but it only showed one data string for input. That might be
adequate for simple tests but, because test cases may need to be shared by
multiple test XSD files it may not be scalable for volume testing or testing
of multiple cases.

Test cases need to be shared by multiple XSD files? Can you explain this? A
test case is a combination of data and schema isn't it?

.mike

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20080623/4467174d/attachment-0001.html