[DFDL-WG] 3.13 on encoding errors - rewording - was Re: Second draft of DFDL Errata v011

Wed Jan 16 11:40:29 EST 2013

Revised per call on 2013-01-16

TBDs resolved.

Language about lengthUnits='bytes' and fragment characters at the end
changed to drop requirement of fixed-width characters. Since these rules
are simpler now, I moved them directly into the primary sections describing
the functionality rather than using sub-sections at the end.

I will fold this into errata v11 (when I get it from Steve), to create
v11.1.

> ---------------------------------------------------------
>
> Errata 3.13 (Revised)
>
>
> A new sub-section is added to section 11. *(this is probably 11.2, if
> 11.1 is about Unicode byte order marks)*
>
> 11.2    Character Encoding and Decoding Errors
>
> When parsing, these are the errors that can occur when decoding characters
> into Unicode/ISO 10646.
>
> 1.    The data is broken - invalid bit/byte sequences are found which do
> not match the definition of a character for the encoding.
> 2.    Not enough data is found to make up the entire encoding of a
> character. That is, a fragment of a valid encoding is found.
>
> When unparsing, these are the errors that can occur when encoding
> characters from Unicode/ISO 10646 into the specified encoding.
>
> 1.    No mapping provided by the encoding specification.
> 2.    Not enough room to output the entire encoding of the character
> (e.g., need 3 bytes for a character encoding that uses 3-bytes for that
> character, but only 1 byte remains in the available length.
>
> The subsections below describe how these errors are handled.
>
> 11.2.1 property dfdl:encodingErrorPolicy
>
> The property dfdl:encodingErrorPolicy has two possible values: 'error' and
> 'replace'.
>
> 11.2.1.1 dfdl:encodingErrorPolicy='error'
>
> If 'error', then any error when decoding characters while parsing causes a
> parse error. For unparsing, any error when encoding characters causes an
> unparse error.
>
> When parsing, it does not matter if this happens when scanning for
> delimiters, matching a regular expression, matching a literal nil value, or
> constructing the value of a textual element.
>

There is one exception. When lengthKind='bytes', the 'not enough data'
decode error is ignored, and the data making up the fragment character is
skipped over. Symmetrically, when unparsing the 'not enough room' encoding
error is ignored and the left-over bytes are filled with the dfdl:fillByte.

>
> 11.2.1.2 dfdl:encodingErrorPolicy='replace' for Parsing
>
> If 'replace' then any error results in the insertion of the Unicode
> Replacement Character (U+FFFD) as the replacement for that error.
>
> It does not matter if this error and replacement happens when scanning for
> delimiters, matching a regular expression, matching a literal nil value, or
> constructing the value of a textual element.
>

There is one exception. When lengthKind='bytes', the 'not enough data'
decode error is ignored, no replacement character is created. The data
making up the fragment character is skipped over. (It will be filled with
the dfdl:fillByte when unparsing.)

The Unicode Replacement Character must not appear in any delimiter,
padCharacter, nilValue, regular expression, textNumberPattern, or in any
other property value or test pattern where the Unicode Replacement
Character would be expected in the data being parsed. It is a schema
definition error if the Unicode Replacement Character appears in any of
these locations of a DFDL schema, or is part of the value of an expression
that returns a string to be used as the value of a DFDL property.

Note that the "." wildcard in regular expressions will match the Unicode
> Replacement Character, so ".*" and ".+" regular expressions can potentially
> cause very large matches (up to the entire data stream) to occur when data
> contains errors and dfdl:encodingErrorPolicy='replace'. Bounded length
> regular expressions can help in this case. E.g., ".{0,50}" says to match
> any character (including Unicode Replacement Characters), but only up to
> length 50.
>
> It is also worth noting that the Unicode Replacement Character can appear
> in data as an ordinary character, and this cannot be distinguished from the
> insertion of the Unicode Replacement Character due to a decode error.
>
> If lengthUnits='characters', then a Unicode Replacement Character counts
> as contributing a single character to the length.
>
> If the data contains more than one adjacent decode error, then the
> specific number of Unicode Replacement Characters that are inserted as the
> replacement of these errors is implementation dependent. That is, some
> implementations may view, for example, three consecutive erroneous bytes as
> three separate decode errors, others may view them as a single or two
> decode errors. All implementations MUST, however, insert some number of
> Unicode Replacement Characters, and then continue to decode characters
> following the erroneous data.
>
> The trimming of padding characters always happens after Unicode
> Replacement Characters have been inserted into the data.
>
> 11.2.1.3 dfdl:encodingErrorPolicy='replace' for Unparsing
>
> For unparsing, each encoding has a replacement/substitution character
> specified by the ICU. This character is substituted for the unmapped
> character or the character that has too large an encoding to fit in the
> available space.

There is one exception. When lengthKind='bytes', the 'not enough room'
encoding error is ignored. The left-over bytes are filled with the
dfdl:fillByte (they are skipped when parsing.)

>
> The definitions of these substitution characters can be conveniently found
> for many encodings in the ICU Converter Explorer (
> http://demo.icu-project.org/icu-bin/convexp).
>
> An encoding error is an unparse error if the encoding does not provide a
> substitution/replacement character definition. (This would be rare, but
> could occur if a DFDL implementation allows many encodings beyond the
> minimum set.)
>
>
> 11.2.1.4  Parsing: Unicode Decoding Non-Errors
>
> The following specific situations involving encodings UTF-16, UTF-16LE,
> and UTF-16BE when utf16Width="fixed", and they do not cause a decoding or
> encoding error.
> •    unpaired surrogate code-point
> •    out-of-order surrogate code-point pair
> •    surrogate code point pair is encountered
>
> In all these cases the code-point(s) becomes a character code in the DFDL
> Information Item for the string.
>
> 11.2.2    Preserving Data Containing Decoding Errors
>
>
> There can be situations where data wants to be preserved exactly even if
> it contains errors.
>
> It is suggested that if a DFDL schema author wants to preserve information
> containing data where the data may have decoding errors, that they model
> such data as xs:hexBinary, or as xs:string but using an encoding such as
> iso-8859-1 which preserves all bytes.
>
>
> --
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
> www.tresys.com
>
>

-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130116/a759d8e0/attachment.html>