[DFDL-WG] Issue 156 - ICU fallback mappings - character encoding/decoding errors (version 2 - modified per call 2011-12-06)

Thu Dec 15 08:10:57 EST 2011

I think you have identified the right choice for us right now, which is a
stricter UTF-8 as required part of the standard now, with the variants on
UTF-8 left up to implementations.

On Thu, Dec 15, 2011 at 6:05 AM, Tim Kimber <KIMBERT at uk.ibm.com> wrote:

> Actually your original wording is correct  - my memory was at fault. UTF-8
> can go up to 4 bytes. But the confusion in my mind was caused by a distant
> memory of the CESU 6-byte thing. Your questions below are valid ones.
>
> The CESU-8 question will naturally arise because DFDL offers the
> dfdl:utf16width property. The implication of utf16width is that DFDL
> recognises the fact that some applications do not distinguish between a
> UTF-16 code point ( 16 bits ) and a UTF-16 character ( 16 or 32 bits ). The
> existence of the property implies that we want the decision to be an
> explicit decision taken by the modeller. I think that argues for strict
> serialization of UTF-8, with support for the ( non-Unicode ) CESU-8
> encoding being an optional feature in DFDL processors.
>
> regards,
>
> Tim Kimber, Common Transformation Team,
> Hursley, UK
> Internet:  kimbert at uk.ibm.com
> Tel. 01962-816742
> Internal tel. 246742
>
>
>
>
> From:        Mike Beckerle <mbeckerle.dfdl at gmail.com>
> To:        Tim Kimber/UK/IBM at IBMGB, Steve Hanson/UK/IBM at IBMGB,
> dfdl-wg at ogf.org
> Date:        14/12/2011 20:53
> Subject:        Re: [DFDL-WG] Issue 156 - ICU fallback mappings -
> character encoding/decoding errors (version 2 - modified per call
> 2011-12-06)
> ------------------------------
>
>
>
> Tim, do you think you were thinking of*this encoding (CESU-8)*<http://en.wikipedia.org/wiki/CESU-8>(pronounced "sez you") of surrogate pairs as 2 3-byte UTF-8 sequences?
>
> I believe there is also this hack by which code point 0 is encoded as two
> bytes instead of just a 0. Not sure why this was needed, but it was a Java
> object-serialization convention.
>
> I was expecting that the ICU UTF-8 parser would deal with these, but it
> traps them as errors. Using the callback hook one could change it to handle
> them, or an encoding description that is more flexible could be created.
>
> On parsing, being able to accept everything possible seems good.
>
> The big concern is what to generate on unparse. E.g., for a floating
> surrogate, generate CESU-8 3-byte sequence? or error out/substitute? For a
> surrogate pair, generate two 3-byte CESU sequences for 6 byte total, or the
> UTF-8 standard 4-byte encoding?
>
> Or, perhaps we're just trying to squeeze too much into one encoding, and
> we actually need a strict and a tolerant variant of UTF-8? Like maybe
> people should say CESU-8 if that's what they mean?
>
> ...mikeb
>
>
>
> On Wed, Dec 14, 2011 at 5:29 AM, Tim Kimber <*KIMBERT at uk.ibm.com*<KIMBERT at uk.ibm.com>>
> wrote:
> This is a little picky, but as the whole point is to tighten up the
> spec....
>
> UTF-8 characters should only ever be 1,2, or 3 bytes in length.
>
> In some applications a single Unicode character that is outside of the BMP
> ( so needs to be a surrogate pair in UTF-16 ) can end up as a pair of
> 2-byte UTF-8 characters. So the end result is 4 bytes of UTF-8 for a single
> Unicode character. But that's frowned upon by the Unicode consortium. The
> application should convert the single Unicode character to a single 3-byte
> UTF-8 character.
>
> regards,
>
> Tim Kimber, Common Transformation Team,
> Hursley, UK
> Internet:  *kimbert at uk.ibm.com* <kimbert at uk.ibm.com>
> Tel. 01962-816742
> Internal tel. 246742
>
>
>
>
> From:        Steve Hanson/UK/IBM at IBMGB
> To:        Mike Beckerle <*mbeckerle.dfdl at gmail.com*<mbeckerle.dfdl at gmail.com>
> >
> Cc:        *dfdl-wg at ogf.org* <dfdl-wg at ogf.org>, Andreas
> Martens1/UK/IBM at IBMGB
> Date:        14/12/2011 07:45
> Subject:        Re: [DFDL-WG] Issue 156 - ICU fallback mappings -
> character encoding/decoding errors (version 2 - modified per call
> 2011-12-06)
> Sent by:        *dfdl-wg-bounces at ogf.org* <dfdl-wg-bounces at ogf.org>
>  ------------------------------
>
>
>
>
> Mike, I think this proposal looks good and provides an adequate solution
> for DFDL 1.0. Let's discuss further on today's WG call.
>
> Regards
>
> Steve Hanson
> Architect, Data Format Description Language (DFDL)
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> IBM SWG, Hursley, UK*
> **smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:*+44-1962-815848* <%2B44-1962-815848>
>
>
>
> From:        Mike Beckerle <*mbeckerle.dfdl at gmail.com*<mbeckerle.dfdl at gmail.com>
> >
> To:        Steve Hanson/UK/IBM at IBMGB
> Cc:        Andreas Martens1/UK/IBM at IBMGB, *dfdl-wg at ogf.org*<dfdl-wg at ogf.org>
> Date:        07/12/2011 15:02
> Subject:        Re: [DFDL-WG] Issue 156 - ICU fallback mappings -
> character encoding/decoding errors (version 2 - modified per call
> 2011-12-06)
>  ------------------------------
>
>
>
> Alright, I was able to convince myself that a substitution character is
> available, and associated with the IANA character set ID aliases. Even
> us-ascii has one (\x1A) E.g., *
> http://demo.icu-project.org/icu-bin/convexp?conv=US-ASCII&s=ALL*<http://demo.icu-project.org/icu-bin/convexp?conv=US-ASCII&s=ALL>
>
> So our original language that said to just use "the replacement character
> for the encoding" was actually correct!
>
> Revised proposal below. Basically, it's just error, skip or replace flag
> for encoding error policy. We still have to figure out the TBDs in there
> with respect to how many substitution/replacements will occur, and what to
> do about some of these Unicode-encoding related issues.
>
> ...mikeb
>
> ---------------------------------------------------------------------------
> *
>
> Issue 156 - ICU fallback mappings - character encoding/decoding errors*
>
> (modified per email thread on standardized ICU substitution/replacement
> characters)
> (Modified per workgroup discussion on 2011-12-06 - removed rationale and
> discussion, simplified to just the minimum. Note couple of important TBDs
> in here. Topics we forgot to discuss.)*
>
> Summary*
>
> DFDL currently does not have adequate capability to handle encoding and
> decoding errors. Language in the spec is incorrect/infeasible to implement.
> ICU provides mechanisms giving degree of control over this issue, the
> question is whether and how to embrace those mechanisms, or provide some
> other alternative solution.*
>
> Discussion*
>
> This language in section 4.1.2 about character set decoding/encoding just
> doesn't work:
>
> This first part is unacceptable because it fails to specify what happens
> when the decoding fails because of data errors. *
>
> During parsing, characters whose value is unknown or unrepresentable in
> ISO 10646 are replaced by the Unicode Replacement Character U+FFFD. *
>
> This second part also is inadequate:*
>
> During unparsing, characters that are unrepresentable in the target
> encoding will be replaced by the replacement character for that encoding.*
>
> This needs a citation for where these replacement characters are
> specified. It also needs to specify what happens in certain error
> situations. *
>
> Suggested Resolution: Summary*
>
>    - DFDL property dfdl:encodingErrorPolicy with values 'skip', 'error',
>    'replace'
>
> *For Parsing/Decoding Errors*
>
> There are two errors that can occur when decoding characters into
> Unicode/ISO 10646.
> 1.        the data is broken - invalid byte sequences that don't match
> the definition of the encoding are encountered.
> 2.        not enough bytes are found to make up the entire encoding of a
> character. That is, a fragment of a valid encoding is found.
>
> The behavior in these cases is controlled by dfdl:inputEncodingErrorPolicy.
>
> If 'replace', then the Unicode *replacement character*<http://en.wikipedia.org/wiki/Replacement_character>'�' (U+FFFD) is substituted for the offending bytes, one replacement
> character for each invalid byte, one replacement character for any fragment
> of an encoding.*
>
> (TBD: Should this say 'byte' or 'unit' ?? I.e., in UTF-16BE, will ICU
> error callback occur once for a broken codepoint, or once per byte?)
>
> (TBD: Assumptions to validate: I am assuming here that if there are 6
> invalid bytes, none of which can validly be unit 1 of the encoding of any
> character, that ICU will call the error hook either (a) 6 times, or (b)
> once but notifying about all 6 bad units - but providing a way for the
> hook-writer to say they want to substitute 6 characters for the 6 units.
>
> I am also assuming in the end-of-data fragment case that the ICU hook gets
> called once for the fragment, not once per byte of the fragment.)
>
> (TBD: We did not discuss on the call on Dec 6th, the issue of errors in
> unicode encodings. While there are no encodings where a properly encoded
> character is unmapped to unicode, the unicode UTF encodings themselves can
> contains things that are errors. Here's a short list of some things that
> can happen:*
>
>    - *utf-16 and unpaired surrogate code-point*
>    - *utf-16 and out-of-order surrogate code-point pair*
>    - *utf-8 parsing and 3-byte encoding of a surrogate code-point is found
>    *
>    - *utf-8 unparsing and code-point of an isolated surrogate is to be
>    encoded.*
>    - *utf-8 decoding, and if you assemble the bits the usual way, you get
>    a code point out of range (higher than 0x10FFFF)*
>    - *utf-8 encoding, and code-point to encode is higher than 0x10FFFF. *
>    - *utf-16 encoding utf16Width='fixed' and a surrogate code point is
>    encountered*
>    - *utf-16 byte-order-marks found not at the beginning of the data*
>
> *We have an option here to be 'tolerant' of unicode-encoding foibles. We
> can preserve isolated surrogates in a natural way if we wish. I believe
> many Unicode and UTF implementations tolerate these situations. For example
> the standard Java utf-8 decoder/encoder InputStreamReader and
> OutputStreamWriter, is tolerant of incorrectly paired and isolated
> surrogate code points in the Java string data. *
>
> *I do not know what ICU does in these cases, i.e., if it provides us
> enough flexibility to do whatever we want, or if it doesn't even detect
> some of these things as errors.)*
>
>
> If 'skip' then the invalid byte sequences are dropped/ignored. No
> corresponding characters are created in the DFDL infoset.
>
> If 'error' then a processing error occurs.
>
> It is suggested that if a DFDL user wants to preserve information
> containing data where the encodings have these kinds of errors, that they
> model such data as xs:hexBinary, or as a xs:string, but using an encoding
> such as iso-8859-1 which preserves all bytes.*
>
> Suggested Resolution - Unparsing/Encoding Errors*
>
> The following are kinds of errors when encoding characters:
> 1.        no mapping provided by the encoding specification.
> 2.        not enough room to output the entire encoding of the character
> (e.g., need 2 bytes for a DBCS, but only 1 byte remains in the available
> length.
> The behavior in these cases is controlled by dfdl:encodingErrorPolicy.
>
> If the policy is 'error' then a processing error occurs.
>
> If the policy is 'skip' then the character is skipped. No character is
> encoded to be output for case 1, and no partial character is attempted in
> case 2.
>
> If the policy is 'replace' then the behavior is determined by the encoding
> specification.
>
> Each encoding has a replacement/substitution character specified by the
> ICU. These can be found conveniently in the *ICU Converter Explorer.*<http://demo.icu-project.org/icu-bin/convexp> This character is substituted for the unmapped character or the character
> that has too large an encoding (errors 1, and 2 above).
>
> It is a processing error if it is not possible to output the replacement
> character because there is not enough room for its representation.
>
> It is a processing error if a character encoding does not provide a
> substitution/replacement character definition and one is needed because of
> dfdl:encodingErrorPolicy='replace'. (This would be rare, but could occur if
> a DFDL implementation allows many encodings beyond the minimum set.)
>
>
>
>
>
>  ------------------------------
>
> *Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
> *
>
>
>
>
>  --
>  dfdl-wg mailing list
>  *dfdl-wg at ogf.org* <dfdl-wg at ogf.org>
>  *http://www.ogf.org//mailman/listinfo/dfdl-wg*<http://www.ogf.org//mailman/listinfo/dfdl-wg>
>
>
>
>  ------------------------------
> *
> *
>
> *Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
> *
>
>
>
>
>
>
>
>
> --
> Mike Beckerle | OGF DFDL WG Co-Chair
> Tel:  781-330-0412
>
>
>
>
>  ------------------------------
>
> *
> *
>
> *Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
> *
>
>
>
>
>
>
>

-- 
Mike Beckerle | OGF DFDL WG Co-Chair
Tel:  781-330-0412
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20111215/ebe734fc/attachment-0001.html>