[DFDL-WG] 3.13 on encoding errors - rewording - was Re: Second draft of DFDL Errata v011

Tue Jan 15 12:32:38 EST 2013

New further simplified proposal for rewrite of errata 3.13.

Goal: allow simple implementations where a flag is simply set in the
decoder/encoder.

Dropped the problematic 'skip' option.

I also incorporated a point about UTF-16 errors from prior email on this
topic from Tim.

I added a point about how many Unicode Replacement Characters are inserted
when the data contains multiple errors adjacent. Basically, I made this
implementation dependent.

There are two TBDs in the text below that I'd like to resolve.

---------------------------------------------------------

Errata 3.13 (Revised)

A new sub-section is added to section 11. *(this is probably 11.2, if 11.1
is about Unicode byte order marks)*

11.2    Character Encoding and Decoding Errors

When parsing, these are the errors that can occur when decoding characters
into Unicode/ISO 10646.

1.    The data is broken - invalid bit/byte sequences are found which do
not match the definition of a character for the encoding.
2.    Not enough data is found to make up the entire encoding of a
character. That is, a fragment of a valid encoding is found.

When unparsing, these are the errors that can occur when encoding
characters from Unicode/ISO 10646 into the specified encoding.

1.    No mapping provided by the encoding specification.
2.    Not enough room to output the entire encoding of the character (e.g.,
need 3 bytes for a character encoding that uses 3-bytes for that character,
but only 1 byte remains in the available length.

The subsections below describe how these errors are handled.

11.2.1 property dfdl:encodingErrorPolicy

The property dfdl:encodingErrorPolicy has two possible values: 'error' and
'replace'.

11.2.1.1 dfdl:encodingErrorPolicy='error'

If 'error', then any error when decoding characters while parsing causes a
parse error. For unparsing, any error when encoding characters causes an
unparse error.

When parsing, it does not matter if this happens when scanning for
delimiters, matching a regular expression, matching a literal nil value, or
constructing the value of a textual element.

11.2.1.2 dfdl:encodingErrorPolicy='replace' for Parsing

If 'replace' then any error results in the insertion of the Unicode
Replacement Character (U+FFFD) as the replacement for that error.

It does not matter if this error and replacement happens when scanning for
delimiters, matching a regular expression, matching a literal nil value, or
constructing the value of a textual element.

Note however, that unless a DFDL schema specifically uses the Unicode
Replacement Character in a delimiter or nil value, then this character is
certain to not match.

*TBD: I believe we should disallow using the Unicode Replacement Character
in a delimiter, as a pad char, in a pattern, as a feature character (text
number group separator, in the text representation of NaN, boolean true,
calendarPattern, etc, etc.) We could simply say it is an SDE if this
appears in any of these locations.
*

Note that the "." wildcard in regular expressions will match the Unicode
Replacement Character, so ".*" and ".+" regular expressions can potentially
cause very large matches (up to the entire data stream) to occur when data
contains errors and dfdl:encodingErrorPolicy='replace'. Bounded length
regular expressions can help in this case. E.g., ".{0,50}" says to match
any character (including Unicode Replacement Characters), but only up to
length 50.

It is also worth noting that the Unicode Replacement Character can appear
in data as an ordinary character, and this cannot be distinguished from the
insertion of the Unicode Replacement Character due to a decode error.

If lengthUnits='characters', then a Unicode Replacement Character counts as
contributing a single character to the length.

If the data contains more than one adjacent decode error, then the specific
number of Unicode Replacement Characters that are inserted as the
replacement of these errors is implementation dependent. That is, some
implementations may view, for example, three consecutive erroneous bytes as
three separate decode errors, others may view them as a single or two
decode errors. All implementations MUST, however, insert some number of
Unicode Replacement Characters, and then continue to decode characters
following the erroneous data.

The trimming of padding characters always happens after Unicode Replacement
Characters have been inserted into the data.

11.2.1.3 dfdl:encodingErrorPolicy='replace' for Unparsing

For unparsing, each encoding has a replacement/substitution character
specified by the ICU. This character is substituted for the unmapped
character or the character that has too large an encoding to fit in the
available space.

The definitions of these substitution characters can be conveniently found
for many encodings in the ICU Converter Explorer (
http://demo.icu-project.org/icu-bin/convexp).

An encoding error is an unparse error if the encoding does not provide a
substitution/replacement character definition. (This would be rare, but
could occur if a DFDL implementation allows many encodings beyond the
minimum set.)

*TBD: should we rule out this case by providing some default mapping that
can always be used. E.g., in the above corner case '?' is used as the
substitution character.*

11.2.1.4 Parsing: The Not-Enough-Data Decode Error

There is one special case for the 'not enough data' decode error. For
lengthUnits='bytes' when the encoding is a fixed-width character set (see
section 12.3.7.1.1 Character Width). If the number of bytes is not a
multiple of the character set width, then there will be some number of
bytes left over at the end of the data which are insufficient to hold an
entire character code. In this case no attempt is made to decode a
character from these left-over bytes. They are skipped when parsing (and
filled with the dfdl:fillByte on unparsing).

11.2.1.5  Parsing: Unicode Decoding Non-Errors

The following specific situations involving encodings UTF-16, UTF-16LE, and
UTF-16BE when utf16Width="fixed", and they do not cause a decoding or
encoding error.
•    unpaired surrogate code-point
•    out-of-order surrogate code-point pair
•    surrogate code point pair is encountered

In all these cases the code-point(s) becomes a character code in the DFDL
Information Item for the string.

11.2.1.5 Unparsing: The Not-Enough-Room Encoding Error

There is one special case for the 'not enough room' encoding error. For
lengthUnits='bytes' when the encoding is a fixed-width character set (see
section 12.3.7.1.1 Character Width). If the number of bytes is not a
multiple of the character set width, then there will be some number of
bytes of space left over at the end of the data which are insufficient to
hold an entire character code. In this case no attempt is made to encode a
character into these left-over bytes. They are filled with the
dfdl:fillByte. (On parsing they are skipped.)

11.2.2    Preserving Data Containing Decoding Errors

There can be situations where data wants to be preserved exactly even if it
contains errors.

It is suggested that if a DFDL schema author wants to preserve information
containing data where the data may have decoding errors, that they model
such data as xs:hexBinary, or as xs:string but using an encoding such as
iso-8859-1 which preserves all bytes.

-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130115/72e55b49/attachment.html>