[DFDL-WG] Infoset codepage
Alan Powell
alan_powell at uk.ibm.com
Tue May 5 10:13:47 CDT 2009
Isn't choice 2 the most flexible? The caller can convert to what they
need.
Alan Powell
MP 211, IBM UK Labs, Hursley, Winchester, SO21 2JN, England
Notes Id: Alan Powell/UK/IBM email: alan_powell at uk.ibm.com
Tel: +44 (0)1962 815073 Fax: +44 (0)1962 816898
From:
DFDL <mbeckerle.dfdl at gmail.com>
To:
Steve Hanson/UK/IBM at IBMGB
Cc:
Alan Powell/UK/IBM at IBMGB, "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>,
"dfdl-wg-bounces at ogf.org" <dfdl-wg-bounces at ogf.org>
Date:
05/05/2009 15:35
Subject:
Re: [DFDL-WG] Infoset codepage
How about we specify unicode codepoints but implementations can have
limitations on the numeric range of codepoints.
Reason: keeps us out of the codepoints vs. encodings morass.
...mikeb
On May 5, 2009, at 10:20 AM, Steve Hanson <smh at uk.ibm.com> wrote:
There is a 4th option - remain silent and leave it up to the
implementation.
Reason: Within IBM we have different products that will embed DFDL
parser/unparser. WMB requires strings in UTF-16, that is not always the
case for others.
Regards
Steve Hanson
Programming Model Architect
WebSphere Message Brokers
Hursley, UK
Internet: smh at uk.ibm.com
Phone (+44)/(0) 1962-815848
"Mike Beckerle" <mbeckerle.dfdl at gmail.com>
Sent by: dfdl-wg-bounces at ogf.org
05/05/2009 14:09
Please respond to
mbeckerle.dfdl at gmail.com
To
Alan Powell/UK/IBM at IBMGB, <dfdl-wg at ogf.org>
cc
Subject
[DFDL-WG] Infoset codepage
4. Infoset codepage and encoding
The spec does not say what codepage and encoding is used for string
fields.
I wanted to comment on this.
There are three choices here:
1. unicode codepoints - we may need to preserve the mapping table
(from representation encoding to unicode) as part of the infoset.
2. "As Encoded" codepoints - we must add the encoding to the
infoset.
3. Both
In favor of unicode codepoints - simplicity. Minor issue is that some
mappings will lose information making perfect round-tripping of string
contents impossible.
E.g., EBCDIC has two different line-endings both of which normally are
translated to ASCII/Unicode linefeed. Hence, translating back is
ambiguous.
In favor of "as encoded" - simplicity. We just add an encoding attribute
to the string infoset object which returns the information that the
dfdl:encoding representation property contained. Note that the encoding
information really is already available via the schema component
associated with the string, so there is some redundancy here. Also,
there's the issue when dealing with this of whether one wants codepoints,
or raw access to the bytes. E.g., if the encoding is UTF-8 or shifted JIS,
then the characters take up 1 or more bytes. Do you want the bytes, or the
interpreted code points or both?
In favor of "both" - complexity, but eliminates all the ambiguity.
My suggestion: keep it simple for v1.0 - Choose number 1 - because we can
always expand the capabilities later by providing access to the unencoded
representation one way or another.
If you badly need infoset-level contents which expose the actual
representation character codes, you can always model this as an array of
bytes instead of a character string.
...mike
Mike Beckerle | OGF DFDL WG Co-Chair | CTO | Oco, Inc.
Tel: 781-810-2125 | 100 Fifth Ave., 4th Floor, Waltham MA 02451 |
mbeckerle.dfdl at gmail.com --
dfdl-wg mailing list
dfdl-wg at ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20090505/ececd165/attachment-0001.html
More information about the dfdl-wg
mailing list