[DFDL-WG] Action 283: Provision for fallback mappings

Steve Hanson smh at uk.ibm.com
Thu Aug 27 04:32:32 EDT 2015


It's obviously less disruptive to the DFDL spec to add extra enums to 
dfdl:encodingErrorPolicy.  My concern in doing that is the orthogonality 
of substitition characters (an error has occurred) and fallbacks (defined 
mappings for a purpose). So let's look at the scenarios we need to support 
and see if that can generate a set of reasonably natural enums:

1) Error unmappable characters; fallbacks not required => "error"
2) Replace unmappable characters; fallbacks not required => "replace"
3) Error unmappable characters; fallbacks required => "fallback"
4) Replace unmappable characters; fallbacks required => 
"fallbackOrReplace"

I think two new enums are needed as one IBM product that uses IBM DFDL 
said it wanted fallback but not substitution.

Regards
 
Steve Hanson
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848



From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     Steve Hanson/UK/IBM at IBMGB
Cc:     DFDL-WG <dfdl-wg at ogf.org>
Date:   26/08/2015 14:32
Subject:        Re: [DFDL-WG] Action 283: Provision for fallback mappings



Or... perhaps dfdl:encodingErrorPolicy="replaceOrFallback", that is, 
perhaps we can just add another enum value to reflect this policy rather 
than adding more properties.

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject to the OGF Intellectual Property Policy


On Tue, Aug 25, 2015 at 10:56 AM, Mike Beckerle <mbeckerle.dfdl at gmail.com> 
wrote:
Would an IBM-specific property, to be proposed for future inclusion in 
DFDL. E.g., something like 

ibmdfdl:encodingErrorFallbackPolicy="never" or "fallback" with other enums 
reserved for the future.

I would like to pave a path for these sorts of proposed features. It would 
be good to see if this alone is sufficient to meet your customer's needs 
who are asking for this, or whether they will need even a bit more control 
than this. 

It looks like we just missed some unparse behavior in 
dfdl:encodingErrorPolicy="replace", as clearly when a Unicode character 
has no mapping, and the target encoding is SBCS and ascii-derived, then 
the 0x1A character is the right thing. 

However, I know what will happen in Daffodil is what the standard ICU 
library does, with its default mapping definitions, and I don't know that 
this 0x1A substitution character is properly used in those mappings.




Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject to the OGF Intellectual Property Policy


On Tue, Aug 25, 2015 at 9:29 AM, Steve Hanson <smh at uk.ibm.com> wrote:
Today the DFDL 1.0 spec has property dfdl:encodingErrorPolicy to control 
what happens when an unmappable or malformed character is encountered - 
'error' or 'replace'. When 'replace' the appropriate substitution 
character is used. 

There is also the orthogonal question of fallback mappings, which are 
mappings specified by an encoding which is not a normal round-trip 
mapping.  DFDL does not currently provide for switching on fallback 
mappings. Here's what ICU says about this at 
http://userguide.icu-project.org/conversion/data. 

In the CHARMAP section of a .ucm file, each line contains a Unicode code 
point (like <U(1-6 hexadecimal digits for the code point)> ), a codepage 
character byte sequence (each byte like \xhh (2 hexadecimal digits} ), and 
an optional "precision" or "fallback" indicator. 
The precision indicator either must be present in all mappings or in none 
of them. The indicator is a pipe symbol ‘|’ followed by a 0, 1, 2, 3, or 4 
that has the following meaning: 
|0 - A "normal", roundtrip mapping from a Unicode code point and back. 
|1 - A "fallback" mapping only from Unicode to the codepage, but not back. 

|2 – A subchar1 mapping. The code point is unmappable, and if a 
substitution is performed, then the subchar1 should be used rather than 
the subchar. Otherwise, such mappings are ignored. 
|3 - A "reverse fallback" mapping only from the codepage to Unicode, but 
not back to the codepage. 
|4 - A "good one-way" mapping only from Unicode to the codepage, but not 
back.
Fallback mappings from Unicode typically do not map codes for the same 
character, but for "similar" ones. This mapping is sometimes done if a 
character exists in Unicode but not in the codepage. To replace it, ICU 
maps a codepage code to a similar-looking code for human-readable output. 
This mapping feature is not useful for text data transmission especially 
in markup languages where a Unicode code point can be escaped with its 
code point value. The ICU application programming interface (API) 
ucnv_setFallback() controls this fallback behavior. 
"Reverse fallbacks" are technically similar, but the same Unicode 
character can be encoded twice in the codepage. ICU always uses reverse 
fallbacks at runtime. 
A subset of the fallback mappings from Unicode is always used at runtime: 
Those that map private-use Unicode code points. Fallbacks from private-use 
code points are often introduced as replacements for previous roundtrip 
mappings for the same pair of codes. These replacements are used when a 
Unicode version assigns a new character that was previously mapped to that 
private-use code point. The mapping table is then changed to map the same 
codepage byte sequence to the new Unicode code point (as a new roundtrip) 
and the mapping from the old private-use code point to the same codepage 
code is preserved as a fallback. 
A "good one-way" mapping is like a fallback, but ICU always uses "good 
one-way" mappings at runtime, regardless of the fallback API flag. 
The idea is that fallbacks normally lose information, such as mapping from 
a compatibility variant of a letter to the ASCII version; however, 
fallbacks from PUA and reverse fallbacks are assumed to be for "the same 
character", just an older code for it.

So the default behaviour for ICU is to use "good one-way" mappings, 
"reverse fallback" mappings, and "fallback" mappings from private-use-area 
code points, but only to use normal "fallback" mappings if the setFallback 
API has been used.   

IBM customers have requested the ability to use normal "fallback" 
mappings. At the current time, the only solution open to them is to change 
the .ucm file (or create a variant) and change the "|1" mappings to "|4" 
so that "fallback" mappings become "good one-way" mappings. 

A proposal to support fallbacks was submitted a few years ago by Mike. 
https://www.ogf.org/pipermail/dfdl-wg/2011-November/001631.html. It 
proposed adding new DFDL annotations to allow replacement characters and 
fallback mappings to be specified.  This was rejected as ICU already 
provides this via the .ucm file. But no simpler alternative materialised, 
and the resulting erratum only added dfdl:encodingErrorPolicy, which does 
not handle fallbacks. 
  
Given a) the precedent of existing IBM DFDL and Daffodil behaviour which 
(should) match the ICU default, b) the orthogonality of substitition 
characters (an error has occurred) and fallbacks (defined mappings for a 
purpose), and b) an IBM recommendation not to switch on fallbacks by 
default, it feels like we need a new property eg: 
dfdl:useEncodingFallbacks 'yes' | 'no'.  Alternatives welcome. The names 
dfdl:encodingFallbackPolicy or dfdl:encodingPrecisionPolicy are better, 
but then comes the problem of finding meaningful enum values... 

Also noted: The woridng for dfdl:encodingErrorPolicy 'replace' says: If 
'replace' then any error when decoding characters results in the insertion 
of the Unicode Replacement Character (U+FFFD) as the replacement for that 
error. That is not strictly true, as the same ICU page says: 
Conversion from a codepage to Unicode occurs and an unassigned codepoint 
is found 
1.        If the input sequence is of length 1 and a subchar1 byte is 
specified for the codepage [in the .ucm file], output U+001A 
2.        Otherwise output U+FFFD

There is then the question of how do the two properties interact. 
Specifically, if fallbacks are not being used, does encountering a code 
point with a fallback result dfdl:encodingErrorPolicy coming in to play?  
I suspect so but needs verifying. 

Regards
 
Steve Hanson
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg



Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20150827/e47e5c3b/attachment-0001.html>


More information about the dfdl-wg mailing list