[DFDL-WG] Infoset codepage

Tue May 5 10:13:47 CDT 2009

Isn't choice 2 the most flexible? The caller can convert to what they 
need.

Alan Powell

 MP 211, IBM UK Labs, Hursley,  Winchester, SO21 2JN, England
 Notes Id: Alan Powell/UK/IBM     email: alan_powell at uk.ibm.com 
 Tel: +44 (0)1962 815073                  Fax: +44 (0)1962 816898

From:
DFDL <mbeckerle.dfdl at gmail.com>
To:
Steve Hanson/UK/IBM at IBMGB
Cc:
Alan Powell/UK/IBM at IBMGB, "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, 
"dfdl-wg-bounces at ogf.org" <dfdl-wg-bounces at ogf.org>
Date:
05/05/2009 15:35
Subject:
Re: [DFDL-WG] Infoset codepage

How about we specify unicode codepoints but implementations can have 
limitations on the numeric range of codepoints. 

Reason: keeps us out of the codepoints vs. encodings morass. 

...mikeb

On May 5, 2009, at 10:20 AM, Steve Hanson <smh at uk.ibm.com> wrote:

There is a 4th option - remain silent and leave it up to the 
implementation. 

Reason:  Within IBM we have different products that will embed DFDL 
parser/unparser. WMB requires strings in UTF-16, that is not always the 
case for others. 

Regards

Steve Hanson
Programming Model Architect
WebSphere Message Brokers
Hursley, UK
Internet: smh at uk.ibm.com
Phone (+44)/(0) 1962-815848 

"Mike Beckerle" <mbeckerle.dfdl at gmail.com> 
Sent by: dfdl-wg-bounces at ogf.org 
05/05/2009 14:09 

Please respond to
mbeckerle.dfdl at gmail.com

To
Alan Powell/UK/IBM at IBMGB, <dfdl-wg at ogf.org> 
cc

Subject
[DFDL-WG] Infoset codepage

4. Infoset codepage and encoding 

The spec does not say what codepage and encoding is used for string 
fields. 
I wanted to comment on this. 
There are three choices here: 
1.        unicode codepoints - we may need to preserve the mapping table 
(from representation encoding to unicode) as part of the infoset. 
2.        "As Encoded" codepoints  - we must add the encoding to the 
infoset. 
3.        Both 
In favor of unicode codepoints - simplicity. Minor issue is that some 
mappings will lose information making perfect round-tripping of string 
contents impossible. 
E.g., EBCDIC has two different line-endings both of which normally are 
translated to ASCII/Unicode linefeed. Hence, translating back is 
ambiguous. 

In favor of "as encoded" - simplicity. We just add an encoding attribute 
to the string infoset object which returns the information that the 
dfdl:encoding representation property contained. Note that the encoding 
information really is already available via the schema component 
associated with the string, so there is some redundancy here. Also, 
there's the issue when dealing with this of whether one wants codepoints, 
or raw access to the bytes. E.g., if the encoding is UTF-8 or shifted JIS, 
then the characters take up 1 or more bytes. Do you want the bytes, or the 
interpreted code points or both? 

In favor of "both" - complexity, but eliminates all the ambiguity. 

My suggestion: keep it simple for v1.0 - Choose number 1 - because we can 
always expand the capabilities later by providing access to the unencoded 
representation one way or another. 

If you badly need infoset-level contents which expose the actual 
representation character codes, you can always model this as an array of 
bytes instead of a character string. 

...mike 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20090505/ececd165/attachment-0001.html