[DFDL-WG] Infoset codepage
DFDL
mbeckerle.dfdl at gmail.com
Tue May 5 09:35:01 CDT 2009
How about we specify unicode codepoints but implementations can have
limitations on the numeric range of codepoints.
Reason: keeps us out of the codepoints vs. encodings morass.
...mikeb
On May 5, 2009, at 10:20 AM, Steve Hanson <smh at uk.ibm.com> wrote:
>
> There is a 4th option - remain silent and leave it up to the
> implementation.
>
> Reason: Within IBM we have different products that will embed DFDL
> parser/unparser. WMB requires strings in UTF-16, that is not always
> the case for others.
>
> Regards
>
> Steve Hanson
> Programming Model Architect
> WebSphere Message Brokers
> Hursley, UK
> Internet: smh at uk.ibm.com
> Phone (+44)/(0) 1962-815848
>
>
> "Mike Beckerle" <mbeckerle.dfdl at gmail.com>
> Sent by: dfdl-wg-bounces at ogf.org
> 05/05/2009 14:09
>
> Please respond to
> mbeckerle.dfdl at gmail.com
>
> To
> Alan Powell/UK/IBM at IBMGB, <dfdl-wg at ogf.org>
> cc
> Subject
> [DFDL-WG] Infoset codepage
>
>
>
>
>
>
> 4. Infoset codepage and encoding
>
> The spec does not say what codepage and encoding is used for string
> fields.
> I wanted to comment on this.
>
> There are three choices here:
> 1. unicode codepoints - we may need to preserve the mapping
> table (from representation encoding to unicode) as part of the
> infoset.
> 2. "As Encoded" codepoints - we must add the encoding to the
> infoset.
> 3. Both
> In favor of unicode codepoints - simplicity. Minor issue is that
> some mappings will lose information making perfect round-tripping of
> string contents impossible.
> E.g., EBCDIC has two different line-endings both of which normally
> are translated to ASCII/Unicode linefeed. Hence, translating back is
> ambiguous.
>
> In favor of "as encoded" - simplicity. We just add an encoding
> attribute to the string infoset object which returns the information
> that the dfdl:encoding representation property contained. Note that
> the encoding information really is already available via the schema
> component associated with the string, so there is some redundancy
> here. Also, there's the issue when dealing with this of whether one
> wants codepoints, or raw access to the bytes. E.g., if the encoding
> is UTF-8 or shifted JIS, then the characters take up 1 or more
> bytes. Do you want the bytes, or the interpreted code points or both?
>
> In favor of "both" - complexity, but eliminates all the ambiguity.
>
> My suggestion: keep it simple for v1.0 - Choose number 1 - because
> we can always expand the capabilities later by providing access to
> the unencoded representation one way or another.
>
> If you badly need infoset-level contents which expose the actual
> representation character codes, you can always model this as an
> array of bytes instead of a character string.
>
> ...mike
>
>
> Mike Beckerle | OGF DFDL WG Co-Chair | CTO | Oco, Inc.
> Tel: 781-810-2125 | 100 Fifth Ave., 4th Floor, Waltham MA 02451 | mbeckerle.dfdl at gmail.com
> --
> dfdl-wg mailing list
> dfdl-wg at ogf.org
> http://www.ogf.org/mailman/listinfo/dfdl-wg
>
>
>
>
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with
> number 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire
> PO6 3AU
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20090505/9eb046f2/attachment.html
More information about the dfdl-wg
mailing list