[DFDL-WG] Issue 156 - ICU fallback mappings - character encoding/decoding errors (version 3 - modified per call 2011-12-13)

Mike Beckerle mbeckerle.dfdl at gmail.com
Wed Dec 14 01:28:28 EST 2011


Revised to include better characterization of ICU behavior based on some
experiments I ran. Basically, ICU tolerates isolated surrogates in 16-bit
strings, but the UTF-8 conversion requires surrogates to be properly paired.

This is "standard" behavior, i.e., when encoding into UTF-8, any
non-iso10646 character code or surrogate pairing is an error. Similarly
when decoding.

---------------------------------------------------------------------------

*Issue 156 - ICU fallback mappings - character encoding/decoding errors*

(modified per email thread on standardized ICU substitution/replacement
characters)

(Modified per workgroup discussion on 2011-12-06 - removed rationale and
discussion, simplified to just the minimum. Note couple of important TBDs
in here. Topics we forgot to discuss.)

(Modified per workgropu discussion on 2011-12-13 - did experiments to
answer the TBDs in here.)

* Summary*

DFDL currently does not have adequate capability to handle encoding and
decoding errors. Language in the spec is incorrect/infeasible to implement.
ICU provides mechanisms giving degree of control over this issue, the
question is whether and how to embrace those mechanisms, or provide some
other alternative solution.

* Discussion*

This language in section 4.1.2 about character set decoding/encoding just
doesn't work:

This first part is unacceptable because it fails to specify what happens when
the decoding fails because of data errors.

*During parsing, characters whose value is unknown or unrepresentable in
ISO 10646 are replaced by the Unicode Replacement Character U+FFFD. *

This second part also is inadequate:

*During unparsing, characters that are unrepresentable in the target
encoding will be replaced by the replacement character for that encoding.*

This needs a citation for where these replacement characters are specified.
It also needs to specify what happens in certain error situations.

*Suggested Resolution: Summary*

   - DFDL property dfdl:encodingErrorPolicy with values 'skip', 'error',
   'replace'
   - Clarify that DFDL Infoset allows any 16-bit codepoint, not just those
   allowed by ISO 10646

*For Parsing/Decoding Errors*

There are two errors that can occur when decoding characters into
Unicode/ISO 10646.
1.        the data is broken - invalid byte sequences that don't match the
definition of the encoding are encountered.
2.        not enough bytes are found to make up the entire encoding of a
character. That is, a fragment of a valid encoding is found.

The behavior in these cases is controlled by dfdl:inputEncodingErrorPolicy.

If 'replace', then the Unicode *replacement
character*<http://en.wikipedia.org/wiki/Replacement_character>'�'
(U+FFFD) is substituted for the offending errors, one
replacement character for each invalid encoding error. This can be one per
byte if there is a series of all-illegal bytes, or it can be fewer
replacement characters if a multi-byte sequence encoding a character has an
error in the later bytes. For example, in UTF-8, if a 4-byte character has
an error in the last byte of the 4, then a single replacement character is
created. Conversely, if the first byte of a 4-byte character encoding has
been corrupted, then one might get as many as 4 replacement characters.
 **
If 'skip' then the invalid byte sequences are dropped/ignored. No
corresponding characters are created in the DFDL infoset.

If 'error' then a processing error occurs.

It is suggested that if a DFDL user wants to preserve information
containing data where the encodings have these kinds of errors, that they
model such data as xs:hexBinary, or as a xs:string, but using an encoding
such as iso-8859-1 which preserves all bytes.

*Suggested Resolution - Unparsing/Encoding Errors*

The following are kinds of errors when encoding characters:
1.        no mapping provided by the encoding specification.
2.        not enough room to output the entire encoding of the character
(e.g., need 2 bytes for a DBCS, but only 1 byte remains in the available
length.
The behavior in these cases is controlled by dfdl:encodingErrorPolicy.

If the policy is 'error' then a processing error occurs.

If the policy is 'skip' then the character is skipped. No character is
encoded to be output for case 1, and no partial character is attempted in
case 2.

If the policy is 'replace' then the behavior is determined by the encoding
specification.

Each encoding has a replacement/substitution character specified by the
ICU. These can be found conveniently in the ICU Converter
Explorer.<http://demo.icu-project.org/icu-bin/convexp>
This character is substituted for the unmapped character or the character
that has too large an encoding (errors 1, and 2 above).

It is a processing error if it is not possible to output the replacement
character because there is not enough room for its representation. For
example, for UTF-8 encoding, the standard substitution character is
represented by 3 bytes. If there is no room for 3 bytes, then it is a
processing error.

It is a processing error if a character encoding does not provide a
substitution/replacement character definition and one is needed because of
dfdl:encodingErrorPolicy='replace'. (This would be rare, but could occur if
a DFDL implementation allows many encodings beyond the minimum set.)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20111214/91f1f6df/attachment.html>


More information about the dfdl-wg mailing list