[DFDL-WG] Issue 156 - ICU fallback mappings - character encoding/decoding errors - UTF-8 minor errata needed

Thu Dec 15 10:07:35 EST 2011

OK - we should take a minor spec errata to add a note to say that UTF-8 in 
DFDL really does mean UTF-8 and not CESU-8 etc.

I will add this to the next WG call agenda as a separate item .

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848

From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     Tim Kimber/UK/IBM at IBMGB
Cc:     dfdl-wg at ogf.org, Steve Hanson/UK/IBM at IBMGB
Date:   15/12/2011 13:11
Subject:        Re: [DFDL-WG] Issue 156 - ICU fallback mappings - 
character encoding/decoding errors (version 2 - modified per call 
2011-12-06)

I think you have identified the right choice for us right now, which is a 
stricter UTF-8 as required part of the standard now, with the variants on 
UTF-8 left up to implementations.

On Thu, Dec 15, 2011 at 6:05 AM, Tim Kimber <KIMBERT at uk.ibm.com> wrote:
Actually your original wording is correct  - my memory was at fault. UTF-8 
can go up to 4 bytes. But the confusion in my mind was caused by a distant 
memory of the CESU 6-byte thing. Your questions below are valid ones. 

The CESU-8 question will naturally arise because DFDL offers the 
dfdl:utf16width property. The implication of utf16width is that DFDL 
recognises the fact that some applications do not distinguish between a 
UTF-16 code point ( 16 bits ) and a UTF-16 character ( 16 or 32 bits ). 
The existence of the property implies that we want the decision to be an 
explicit decision taken by the modeller. I think that argues for strict 
serialization of UTF-8, with support for the ( non-Unicode ) CESU-8 
encoding being an optional feature in DFDL processors. 

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  kimbert at uk.ibm.com
Tel. 01962-816742  
Internal tel. 246742

From:        Mike Beckerle <mbeckerle.dfdl at gmail.com> 
To:        Tim Kimber/UK/IBM at IBMGB, Steve Hanson/UK/IBM at IBMGB, 
dfdl-wg at ogf.org 
Date:        14/12/2011 20:53 
Subject:        Re: [DFDL-WG] Issue 156 - ICU fallback mappings - 
character encoding/decoding errors (version 2 - modified per call 
2011-12-06) 

Tim, do you think you were thinking ofthis encoding (CESU-8) (pronounced 
"sez you") of surrogate pairs as 2 3-byte UTF-8 sequences?  

I believe there is also this hack by which code point 0 is encoded as two 
bytes instead of just a 0. Not sure why this was needed, but it was a Java 
object-serialization convention.

I was expecting that the ICU UTF-8 parser would deal with these, but it 
traps them as errors. Using the callback hook one could change it to 
handle them, or an encoding description that is more flexible could be 
created.

On parsing, being able to accept everything possible seems good. 

The big concern is what to generate on unparse. E.g., for a floating 
surrogate, generate CESU-8 3-byte sequence? or error out/substitute? For a 
surrogate pair, generate two 3-byte CESU sequences for 6 byte total, or 
the UTF-8 standard 4-byte encoding?

Or, perhaps we're just trying to squeeze too much into one encoding, and 
we actually need a strict and a tolerant variant of UTF-8? Like maybe 
people should say CESU-8 if that's what they mean?

...mikeb

On Wed, Dec 14, 2011 at 5:29 AM, Tim Kimber <KIMBERT at uk.ibm.com> wrote: 
This is a little picky, but as the whole point is to tighten up the 
spec.... 

UTF-8 characters should only ever be 1,2, or 3 bytes in length. 

In some applications a single Unicode character that is outside of the BMP 
( so needs to be a surrogate pair in UTF-16 ) can end up as a pair of 
2-byte UTF-8 characters. So the end result is 4 bytes of UTF-8 for a 
single Unicode character. But that's frowned upon by the Unicode 
consortium. The application should convert the single Unicode character to 
a single 3-byte UTF-8 character. 

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  kimbert at uk.ibm.com
Tel. 01962-816742  
Internal tel. 246742

From:        Steve Hanson/UK/IBM at IBMGB 
To:        Mike Beckerle <mbeckerle.dfdl at gmail.com> 
Cc:        dfdl-wg at ogf.org, Andreas Martens1/UK/IBM at IBMGB 
Date:        14/12/2011 07:45 
Subject:        Re: [DFDL-WG] Issue 156 - ICU fallback mappings - 
character encoding/decoding errors (version 2 - modified per call 
2011-12-06) 
Sent by:        dfdl-wg-bounces at ogf.org 

Mike, I think this proposal looks good and provides an adequate solution 
for DFDL 1.0. Let's discuss further on today's WG call. 

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848 

From:        Mike Beckerle <mbeckerle.dfdl at gmail.com> 
To:        Steve Hanson/UK/IBM at IBMGB 
Cc:        Andreas Martens1/UK/IBM at IBMGB, dfdl-wg at ogf.org 
Date:        07/12/2011 15:02 
Subject:        Re: [DFDL-WG] Issue 156 - ICU fallback mappings - 
character encoding/decoding errors (version 2 - modified per call 
2011-12-06) 

Alright, I was able to convince myself that a substitution character is 
available, and associated with the IANA character set ID aliases. Even 
us-ascii has one (\x1A) E.g., 
http://demo.icu-project.org/icu-bin/convexp?conv=US-ASCII&s=ALL

So our original language that said to just use "the replacement character 
for the encoding" was actually correct!

Revised proposal below. Basically, it's just error, skip or replace flag 
for encoding error policy. We still have to figure out the TBDs in there 
with respect to how many substitution/replacements will occur, and what to 
do about some of these Unicode-encoding related issues.

...mikeb

---------------------------------------------------------------------------

Issue 156 - ICU fallback mappings - character encoding/decoding errors

(modified per email thread on standardized ICU substitution/replacement 
characters)
(Modified per workgroup discussion on 2011-12-06 - removed rationale and 
discussion, simplified to just the minimum. Note couple of important TBDs 
in here. Topics we forgot to discuss.)

Summary

DFDL currently does not have adequate capability to handle encoding and 
decoding errors. Language in the spec is incorrect/infeasible to 
implement. ICU provides mechanisms giving degree of control over this 
issue, the question is whether and how to embrace those mechanisms, or 
provide some other alternative solution.

Discussion

This language in section 4.1.2 about character set decoding/encoding just 
doesn't work:

This first part is unacceptable because it fails to specify what happens 
when the decoding fails because of data errors. 

During parsing, characters whose value is unknown or unrepresentable in 
ISO 10646 are replaced by the Unicode Replacement Character U+FFFD. 

This second part also is inadequate:

During unparsing, characters that are unrepresentable in the target 
encoding will be replaced by the replacement character for that encoding. 

This needs a citation for where these replacement characters are 
specified. It also needs to specify what happens in certain error 
situations. 

Suggested Resolution: Summary 
DFDL property dfdl:encodingErrorPolicy with values 'skip', 'error', 
'replace'
For Parsing/Decoding Errors

There are two errors that can occur when decoding characters into 
Unicode/ISO 10646. 
1.        the data is broken - invalid byte sequences that don't match the 
definition of the encoding are encountered. 
2.        not enough bytes are found to make up the entire encoding of a 
character. That is, a fragment of a valid encoding is found. 

The behavior in these cases is controlled by 
dfdl:inputEncodingErrorPolicy.

If 'replace', then the Unicode replacement character '�' (U+FFFD) is 
substituted for the offending bytes, one replacement character for each 
invalid byte, one replacement character for any fragment of an encoding.

(TBD: Should this say 'byte' or 'unit' ?? I.e., in UTF-16BE, will ICU 
error callback occur once for a broken codepoint, or once per byte?)

(TBD: Assumptions to validate: I am assuming here that if there are 6 
invalid bytes, none of which can validly be unit 1 of the encoding of any 
character, that ICU will call the error hook either (a) 6 times, or (b) 
once but notifying about all 6 bad units - but providing a way for the 
hook-writer to say they want to substitute 6 characters for the 6 units.

I am also assuming in the end-of-data fragment case that the ICU hook gets 
called once for the fragment, not once per byte of the fragment.)

(TBD: We did not discuss on the call on Dec 6th, the issue of errors in 
unicode encodings. While there are no encodings where a properly encoded 
character is unmapped to unicode, the unicode UTF encodings themselves can 
contains things that are errors. Here's a short list of some things that 
can happen: 
utf-16 and unpaired surrogate code-point 
utf-16 and out-of-order surrogate code-point pair 
utf-8 parsing and 3-byte encoding of a surrogate code-point is found 
utf-8 unparsing and code-point of an isolated surrogate is to be encoded. 
utf-8 decoding, and if you assemble the bits the usual way, you get a code 
point out of range (higher than 0x10FFFF) 
utf-8 encoding, and code-point to encode is higher than 0x10FFFF. 
utf-16 encoding utf16Width='fixed' and a surrogate code point is 
encountered 
utf-16 byte-order-marks found not at the beginning of the data
We have an option here to be 'tolerant' of unicode-encoding foibles. We 
can preserve isolated surrogates in a natural way if we wish. I believe 
many Unicode and UTF implementations tolerate these situations. For 
example the standard Java utf-8 decoder/encoder InputStreamReader and 
OutputStreamWriter, is tolerant of incorrectly paired and isolated 
surrogate code points in the Java string data. 
I do not know what ICU does in these cases, i.e., if it provides us enough 
flexibility to do whatever we want, or if it doesn't even detect some of 
these things as errors.) 

If 'skip' then the invalid byte sequences are dropped/ignored. No 
corresponding characters are created in the DFDL infoset.

If 'error' then a processing error occurs.

It is suggested that if a DFDL user wants to preserve information 
containing data where the encodings have these kinds of errors, that they 
model such data as xs:hexBinary, or as a xs:string, but using an encoding 
such as iso-8859-1 which preserves all bytes.

Suggested Resolution - Unparsing/Encoding Errors

The following are kinds of errors when encoding characters: 
1.        no mapping provided by the encoding specification. 
2.        not enough room to output the entire encoding of the character 
(e.g., need 2 bytes for a DBCS, but only 1 byte remains in the available 
length. 
The behavior in these cases is controlled by dfdl:encodingErrorPolicy.

If the policy is 'error' then a processing error occurs.

If the policy is 'skip' then the character is skipped. No character is 
encoded to be output for case 1, and no partial character is attempted in 
case 2.

If the policy is 'replace' then the behavior is determined by the encoding 
specification.

Each encoding has a replacement/substitution character specified by the 
ICU. These can be found conveniently in the ICU Converter Explorer.  This 
character is substituted for the unmapped character or the character that 
has too large an encoding (errors 1, and 2 above).

It is a processing error if it is not possible to output the replacement 
character because there is not enough room for its representation. 

It is a processing error if a character encoding does not provide a 
substitution/replacement character definition and one is needed because of 
dfdl:encodingErrorPolicy='replace'. (This would be rare, but could occur 
if a DFDL implementation allows many encodings beyond the minimum set.)

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 

--
 dfdl-wg mailing list
 dfdl-wg at ogf.org 
 http://www.ogf.org//mailman/listinfo/dfdl-wg 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 

-- 
Mike Beckerle | OGF DFDL WG Co-Chair 
Tel:  781-330-0412

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 

-- 
Mike Beckerle | OGF DFDL WG Co-Chair 
Tel:  781-330-0412

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20111215/a7299da5/attachment-0001.html>