[DFDL-WG] 8-bit-ascii for dealing with binary data in text-like manner - problematic

Mike Beckerle mbeckerle.dfdl at gmail.com
Wed Feb 10 18:38:08 CST 2010


Every "8-bit-ascii" encoding I can find has holes in the code page. That is,
values that don't have a corresponding character codepoint assigned.

Example: iso-8859-X are a bunch of 8-bit ascii-based encodings that are
popular.

If you lookup iso-8859-1 it has this language:

Code values 00–1F, 7F–9F are not assigned to characters by ISO/IEC 8859-1.
The lower range 20 to 7E (the G0 subset) maps exactly to the same coded G0
subset of the ISO 646 US variant (commonly known as
ASCII<http://en.wikipedia.org/wiki/ASCII>),
...


They're saying 7-bit ascii is included, and some other codes are there, but
they don't assign a codepoint generally.

So, to me suggesting use of any particular code page for this purpose is
somewhat ambiguous. E.g., what does &#x01 mean in a string if the encoding
is iso-8859-1? There appears to be a set of translation tables that assign
this to unicode in standard ways that one can find on the web. But the
codepoint doesn't have an assigned meaning in iso-8859-X standards.

Two possible clarifications:
1) for all ascii-based character sets, we say that bytes 0x00 to 0xFF all
map to exactly those codepoints in ISO 10646 for the infoset, and vice
versa.

2) define dfdl:encoding="bytes" as a special character set name which has
the above property.

Personally, I prefer 2. It is simpler to explain what is going on, and when
people are depending on bytes it will be clearer that they are.

...mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20100210/cba91d40/attachment.html 


More information about the dfdl-wg mailing list