[DFDL-WG] 8-bit-ascii for dealing with binary data in text-like manner - problematic

Mike Beckerle mbeckerle.dfdl at gmail.com
Wed Feb 24 08:57:27 CST 2010


I think we've got a fix for this.

I found an official reference which has no "greyed out" codepoints. All 256
values are "mapped".
The following ftp table (see URL below) officially defines the mapping for
8859-1 to unicode/iso10646.

The table includes all 256 codepoints - some are specified as just <control>
i.e., have no specific meaning, but their 8859 codepoint maps one-to-one and
onto a unicode/10646 codepoint with the same value.

Note that this property holds for 8859-1. It does not hold for 8859-2 to
8859-16, as these have character codes substituted into them that map to
other places in the iso10646 codepoint space.

Here's the correspondence table:

ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT

If we reference this mapping table in the references of the DFDL spec, then
I believe we can say that using encoding="iso-8859-1", you can treat binary
data as textual, use patterns, etc., and the relationship to/from the
infoset always insures preservation of the values of the bytes (parsing),
and creation of bytes whose values exactly match the string codepoints
(unparsing).

This language can be added to the section on lengthKind="pattern" and binary
data:

Binary data can be handled using some of the conveniences of text by way of
treating it as text with encoding="iso-8859-1". In this case literal text,
such as length patterns, is interpreted as in the iso-8859-1 character
encoding, and the correspondence of byte values in the data to a string in
the DFDL infoset is one to one. That is, byte with value N, produces an
infoset character with character code N.  [reference to above FTP site].

On Thu, Feb 11, 2010 at 5:32 AM, Steve Hanson <smh at uk.ibm.com> wrote:

>
> Mike
>
> In the wikipedia entry for ISO 10646 it says "The system deliberately
> leaves many code points not assigned to characters, even in the BMP. It does
> this to allow for future expansion or to minimize conflicts with other
> encoding forms."  If those code points are below 256 then we have the same
> problem as 8859?  I can't find an actual map of the 10646 code points - you
> have to buy it from ISO.
>
> Regards
>
> Steve Hanson
> Programming Model Architect, WebSphere Message Broker,
> OGF DFDL WG Co-Chair,
> Hursley, UK,
> Internet: smh at uk.ibm.com,
> Phone (+44)/(0) 1962-815848
>
>
>  From: Mike Beckerle <mbeckerle.dfdl at gmail.com> To: dfdl-wg at ogf.org Date: 11/02/2010
> 00:38 Subject: [DFDL-WG] 8-bit-ascii for dealing with binary data in
> text-like        manner - problematic Sent by: dfdl-wg-bounces at ogf.org
> ------------------------------
>
>
>
>
> Every "8-bit-ascii" encoding I can find has holes in the code page. That
> is, values that don't have a corresponding character codepoint assigned.
>
> Example: iso-8859-X are a bunch of 8-bit ascii-based encodings that are
> popular.
>
> If you lookup iso-8859-1 it has this language:
>
> Code values 00–1F, 7F–9F are not assigned to characters by ISO/IEC 8859-1.
>
> The lower range 20 to 7E (the G0 subset) maps exactly to the same coded G0
> subset of the ISO 646 US variant (commonly known as *ASCII*<http://en.wikipedia.org/wiki/ASCII>),
> ...
>
> They're saying 7-bit ascii is included, and some other codes are there, but
> they don't assign a codepoint generally.
>
> So, to me suggesting use of any particular code page for this purpose is
> somewhat ambiguous. E.g., what does &#x01 mean in a string if the encoding
> is iso-8859-1? There appears to be a set of translation tables that assign
> this to unicode in standard ways that one can find on the web. But the
> codepoint doesn't have an assigned meaning in iso-8859-X standards.
>
> Two possible clarifications:
> 1) for all ascii-based character sets, we say that bytes 0x00 to 0xFF all
> map to exactly those codepoints in ISO 10646 for the infoset, and vice
> versa.
>
> 2) define dfdl:encoding="bytes" as a special character set name which has
> the above property.
>
> Personally, I prefer 2. It is simpler to explain what is going on, and when
> people are depending on bytes it will be clearer that they are.
>
> ...mike
>
> --
>  dfdl-wg mailing list
>  dfdl-wg at ogf.org
>  http://www.ogf.org/mailman/listinfo/dfdl-wg
>
>
>
>
>  ------------------------------
>
> *
> *
>
> *Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
> *
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20100224/1781f95c/attachment.html 


More information about the dfdl-wg mailing list