[DFDL-WG] Infoset codepage

Tue May 5 09:35:01 CDT 2009

How about we specify unicode codepoints but implementations can have  
limitations on the numeric range of codepoints.

Reason: keeps us out of the codepoints vs. encodings morass.

...mikeb

On May 5, 2009, at 10:20 AM, Steve Hanson <smh at uk.ibm.com> wrote:

>
> There is a 4th option - remain silent and leave it up to the  
> implementation.
>
> Reason:  Within IBM we have different products that will embed DFDL  
> parser/unparser. WMB requires strings in UTF-16, that is not always  
> the case for others.
>
> Regards
>
> Steve Hanson
> Programming Model Architect
> WebSphere Message Brokers
> Hursley, UK
> Internet: smh at uk.ibm.com
> Phone (+44)/(0) 1962-815848
>
>
> "Mike Beckerle" <mbeckerle.dfdl at gmail.com>
> Sent by: dfdl-wg-bounces at ogf.org
> 05/05/2009 14:09
>
> Please respond to
> mbeckerle.dfdl at gmail.com
>
> To
> Alan Powell/UK/IBM at IBMGB, <dfdl-wg at ogf.org>
> cc
> Subject
> [DFDL-WG] Infoset codepage
>
>
>
>
>
>
> 4. Infoset codepage and encoding
>
> The spec does not say what codepage and encoding is used for string  
> fields.
> I wanted to comment on this.
>
> There are three choices here:
> 1.        unicode codepoints - we may need to preserve the mapping  
> table (from representation encoding to unicode) as part of the  
> infoset.
> 2.        "As Encoded" codepoints  - we must add the encoding to the  
> infoset.
> 3.        Both
> In favor of unicode codepoints - simplicity. Minor issue is that  
> some mappings will lose information making perfect round-tripping of  
> string contents impossible.
> E.g., EBCDIC has two different line-endings both of which normally  
> are translated to ASCII/Unicode linefeed. Hence, translating back is  
> ambiguous.
>
> In favor of "as encoded" - simplicity. We just add an encoding  
> attribute to the string infoset object which returns the information  
> that the dfdl:encoding representation property contained. Note that  
> the encoding information really is already available via the schema  
> component associated with the string, so there is some redundancy  
> here. Also, there's the issue when dealing with this of whether one  
> wants codepoints, or raw access to the bytes. E.g., if the encoding  
> is UTF-8 or shifted JIS, then the characters take up 1 or more  
> bytes. Do you want the bytes, or the interpreted code points or both?
>
> In favor of "both" - complexity, but eliminates all the ambiguity.
>
> My suggestion: keep it simple for v1.0 - Choose number 1 - because  
> we can always expand the capabilities later by providing access to  
> the unencoded representation one way or another.
>
> If you badly need infoset-level contents which expose the actual  
> representation character codes, you can always model this as an  
> array of bytes instead of a character string.
>
> ...mike
>
>
> Mike Beckerle | OGF DFDL WG Co-Chair | CTO | Oco, Inc.
> Tel:  781-810-2125  | 100 Fifth Ave., 4th Floor, Waltham MA 02451 | mbeckerle.dfdl at gmail.com 
>  --
>  dfdl-wg mailing list
>  dfdl-wg at ogf.org
>  http://www.ogf.org/mailman/listinfo/dfdl-wg
>
>
>
>
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with  
> number 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire  
> PO6 3AU
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20090505/9eb046f2/attachment.html