[DFDL-WG] Action 283: Provision for fallback mappings

Mike Beckerle mbeckerle.dfdl at gmail.com
Tue Aug 25 11:01:58 EDT 2015


Or... perhaps dfdl:encodingErrorPolicy="replaceOrFallback", that is,
perhaps we can just add another enum value to reflect this policy rather
than adding more properties.

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
<http://www.ogf.org/About/abt_policies.php>


On Tue, Aug 25, 2015 at 10:56 AM, Mike Beckerle <mbeckerle.dfdl at gmail.com>
wrote:

> Would an IBM-specific property, to be proposed for future inclusion in
> DFDL. E.g., something like
>
> ibmdfdl:encodingErrorFallbackPolicy="never" or "fallback" with other enums
> reserved for the future.
>
> I would like to pave a path for these sorts of proposed features. It would
> be good to see if this alone is sufficient to meet your customer's needs
> who are asking for this, or whether they will need even a bit more control
> than this.
>
> It looks like we just missed some unparse behavior in
> dfdl:encodingErrorPolicy="replace", as clearly when a Unicode character has
> no mapping, and the target encoding is SBCS and ascii-derived, then the
> 0x1A character is the right thing.
>
> However, I know what will happen in Daffodil is what the standard ICU
> library does, with its default mapping definitions, and I don't know that
> this 0x1A substitution character is properly used in those mappings.
>
>
>
>
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
> www.tresys.com
> Please note: Contributions to the DFDL Workgroup's email discussions are
> subject to the OGF Intellectual Property Policy
> <http://www.ogf.org/About/abt_policies.php>
>
>
> On Tue, Aug 25, 2015 at 9:29 AM, Steve Hanson <smh at uk.ibm.com> wrote:
>
>> Today the DFDL 1.0 spec has property dfdl:encodingErrorPolicy to control
>> what happens when an unmappable or malformed character is encountered -
>> 'error' or 'replace'. When 'replace' the appropriate substitution character
>> is used.
>>
>> There is also the orthogonal question of fallback mappings, which are
>> mappings specified by an encoding which is not a normal round-trip
>> mapping.  DFDL does not currently provide for switching on fallback
>> mappings. Here's what ICU says about this at
>> http://userguide.icu-project.org/conversion/data.
>>
>> *In the CHARMAP section of a .ucm file, each line contains a Unicode code
>> point (like <U(1-6 hexadecimal digits for the code point)> ), a codepage
>> character byte sequence (each byte like \xhh (2 hexadecimal digits} ), and
>> an optional "precision" or "fallback" indicator.*
>>
>> *The precision indicator either must be present in all mappings or in
>> none of them. The indicator is a pipe symbol ‘|’ followed by a 0, 1, 2, 3,
>> or 4 that has the following meaning:*
>>
>>    - *|0 - A "normal", roundtrip mapping from a Unicode code point and
>>    back.*
>>    - *|1 - A "fallback" mapping only from Unicode to the codepage, but
>>    not back.*
>>    - *|2 – A subchar1 mapping. The code point is unmappable, and if a
>>    substitution is performed, then the subchar1 should be used rather than the
>>    subchar. Otherwise, such mappings are ignored.*
>>    - *|3 - A "reverse fallback" mapping only from the codepage to
>>    Unicode, but not back to the codepage.*
>>    - *|4 - A "good one-way" mapping only from Unicode to the codepage,
>>    but not back.*
>>
>> *Fallback mappings from Unicode typically do not map codes for the same
>> character, but for "similar" ones. This mapping is sometimes done if a
>> character exists in Unicode but not in the codepage. To replace it, ICU
>> maps a codepage code to a similar-looking code for human-readable output.
>> This mapping feature is not useful for text data transmission especially in
>> markup languages where a Unicode code point can be escaped with its code
>> point value. The ICU application programming interface (API) *
>> *ucnv_setFallback()** controls this fallback behavior.*
>>
>> *"Reverse fallbacks" are technically similar, but the same Unicode
>> character can be encoded twice in the codepage. ICU always uses reverse
>> fallbacks at runtime.*
>>
>> *A subset of the fallback mappings from Unicode is always used at
>> runtime: Those that map private-use Unicode code points. Fallbacks from
>> private-use code points are often introduced as replacements for previous
>> roundtrip mappings for the same pair of codes. These replacements are used
>> when a Unicode version assigns a new character that was previously mapped
>> to that private-use code point. The mapping table is then changed to map
>> the same codepage byte sequence to the new Unicode code point (as a new
>> roundtrip) and the mapping from the old private-use code point to the same
>> codepage code is preserved as a fallback.*
>>
>> *A "good one-way" mapping is like a fallback, but ICU always uses "good
>> one-way" mappings at runtime, regardless of the fallback API flag.*
>>
>> *The idea is that fallbacks normally lose information, such as mapping
>> from a compatibility variant of a letter to the ASCII version; however,
>> fallbacks from PUA and reverse fallbacks are assumed to be for "the same
>> character", just an older code for it.*
>>
>> So the default behaviour for ICU is to use "good one-way" mappings,
>> "reverse fallback" mappings, and "fallback" mappings from private-use-area
>> code points, but only to use normal "fallback" mappings if the setFallback
>> API has been used.
>>
>> IBM customers have requested the ability to use normal "fallback"
>> mappings. At the current time, the only solution open to them is to change
>> the .ucm file (or create a variant) and change the "|1" mappings to "|4" so
>> that "fallback" mappings become "good one-way" mappings.
>>
>> A proposal to support fallbacks was submitted a few years ago by Mike.
>> https://www.ogf.org/pipermail/dfdl-wg/2011-November/001631.html. It
>> proposed adding new DFDL annotations to allow replacement characters and
>> fallback mappings to be specified.  This was rejected as ICU already
>> provides this via the .ucm file. But no simpler alternative materialised,
>> and the resulting erratum only added dfdl:encodingErrorPolicy, which does
>> not handle fallbacks.
>>
>> Given a) the precedent of existing IBM DFDL and Daffodil behaviour which
>> (should) match the ICU default, b) the orthogonality of substitition
>> characters (an error has occurred) and fallbacks (defined mappings for a
>> purpose), and b) an IBM recommendation not to switch on fallbacks by
>> default, it feels like we need a new property eg: *dfdl:useEncodingFallbacks
>> 'yes' | 'no'*.  Alternatives welcome. The names
>> dfdl:encodingFallbackPolicy or dfdl:encodingPrecisionPolicy are better, but
>> then comes the problem of finding meaningful enum values...
>>
>> Also noted: The woridng for dfdl:encodingErrorPolicy 'replace' says: *If
>> 'replace' then any error when decoding characters results in the insertion
>> of the Unicode Replacement Character (U+FFFD) as the replacement for that
>> error.* That is not strictly true, as the same ICU page says:
>>
>>    - *Conversion from a codepage to Unicode occurs and an unassigned
>>    codepoint is found*
>>    *1.        **If the input sequence is of length 1 and a subchar1 byte
>>    is specified for the codepage *[in the .ucm file]*, output U+001A*
>>    *2.        **Otherwise output U+FFFD*
>>
>>
>> There is then the question of how do the two properties interact.
>> Specifically, if fallbacks are not being used, does encountering a code
>> point with a fallback result dfdl:encodingErrorPolicy coming in to play?  I
>> suspect so but needs verifying.
>>
>> Regards
>>
>> Steve Hanson
>> Architect, *IBM DFDL*
>> <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
>> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
>> IBM SWG, Hursley, UK
>> *smh at uk.ibm.com* <smh at uk.ibm.com>
>> tel:+44-1962-815848
>> Unless stated otherwise above:
>> IBM United Kingdom Limited - Registered in England and Wales with number
>> 741598.
>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>>
>>
>> --
>>   dfdl-wg mailing list
>>   dfdl-wg at ogf.org
>>   https://www.ogf.org/mailman/listinfo/dfdl-wg
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20150825/1cbb01fb/attachment-0001.html>


More information about the dfdl-wg mailing list