[DFDL-WG] Infoset codepage

Mike Beckerle mbeckerle.dfdl at gmail.com
Tue May 5 08:09:07 CDT 2009


4. Infoset codepage and encoding 

The spec does not say what codepage and encoding is used for string fields. 



I wanted to comment on this. 


There are three choices here: 


1.	unicode codepoints - we may need to preserve the mapping table (from
representation encoding to unicode) as part of the infoset.
2.	"As Encoded" codepoints  - we must add the encoding to the infoset.
3.	Both

In favor of unicode codepoints - simplicity. Minor issue is that some
mappings will lose information making perfect round-tripping of string
contents impossible.
E.g., EBCDIC has two different line-endings both of which normally are
translated to ASCII/Unicode linefeed. Hence, translating back is ambiguous.
 
In favor of "as encoded" - simplicity. We just add an encoding attribute to
the string infoset object which returns the information that the
dfdl:encoding representation property contained. Note that the encoding
information really is already available via the schema component associated
with the string, so there is some redundancy here. Also, there's the issue
when dealing with this of whether one wants codepoints, or raw access to the
bytes. E.g., if the encoding is UTF-8 or shifted JIS, then the characters
take up 1 or more bytes. Do you want the bytes, or the interpreted code
points or both?
 
In favor of "both" - complexity, but eliminates all the ambiguity.
 
My suggestion: keep it simple for v1.0 - Choose number 1 - because we can
always expand the capabilities later by providing access to the unencoded
representation one way or another. 
 
If you badly need infoset-level contents which expose the actual
representation character codes, you can always model this as an array of
bytes instead of a character string. 
 
...mike
 
Mike Beckerle | OGF DFDL WG Co-Chair | CTO | Oco, Inc.
Tel:  781-810-2125  | 100 Fifth Ave., 4th Floor, Waltham MA 02451 |
<mailto:mbeckerle.dfdl at gmail.com> mbeckerle.dfdl at gmail.com 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20090505/6dd5147c/attachment-0001.html 


More information about the dfdl-wg mailing list