[DFDL-WG] ICU and dfdl:textNumberPattern issue

Steve Hanson smhdfdl at gmail.com
Thu Mar 14 11:50:12 PDT 2024


Clever!

On Thu, Mar 14, 2024 at 5:38 PM Mike Beckerle <mbeckerle.dfdl at gmail.com>
wrote:

> Reporting back to the DFDL WG on how this issue got resolved on the
> Daffodil project.
>
> The issue came up in the NITF data format. The schema is on github
> DFDLSchemas site.
>
> Here is the long issue: https://github.com/DFDLSchemas/NITF/issues/19
> read intro then skip to this comment:
> https://github.com/DFDLSchemas/NITF/issues/19#issuecomment-1889110915
>
> Also related: https://issues.apache.org/jira/browse/DAFFODIL-2870 which
> is is a open issue to warn about any ICU negative pattern chars that are
> ignored.
>
> Also related: https://github.com/apache/daffodil/pull/1139 which is a
> related fix.
>
> On Thu, Mar 14, 2024 at 11:55 AM Steve Hanson <smhdfdl at gmail.com> wrote:
>
>> Mike, this is the thread about the negative pattern length issue.
>>
>> On Mon, Jan 15, 2024 at 11:10 AM Steve Hanson <smhdfdl at gmail.com> wrote:
>>
>>> I seem to recall hitting the same problem with X12 fixed length numerics
>>> which could be positive or negative. If I get time before our next call
>>> I'll re-familiarise myself with the X12 issue and what I did to work round
>>> it.
>>>
>>> Regards
>>> Steve
>>>
>>> On Thu, Jan 11, 2024 at 6:25 PM Mike Beckerle <mbeckerle at apache.org>
>>> wrote:
>>>
>>>> We've run into an issue with dfdl:textNumberPattern, which is an ICU
>>>> number pattern. I'll discuss here, and then suggest this is a fix needed in
>>>> DFDL generally, but we should discuss that hypothesis.
>>>>
>>>> The motivating example is fixed length 5 character integer text data.
>>>> The data ranges from -9999 to 99999. Note that the minus sign uses up one
>>>> of the 5 characters that can be a digit for positive values.
>>>>
>>>> Consider the value -123 and textNumberPattern of "00000;-0". The value
>>>> unparses as -00123 which is length 6 so too long.
>>>>
>>>> The padding feature of ICU number patterns can be used to "fix" this.
>>>> Consider textNumberPattern="*0####0". The "*0" notation means to use 0 as
>>>> the padding character to replace the "#" when needed.
>>>> Now the value 123 unparses as 00123 but ... here's the problem.... -123
>>>> unparses as 0-123. Notice how the zero padding is before the minus sign
>>>> when we wanted it to appear after.
>>>>
>>>> This problem is caused by ICU taking nearly all the information from
>>>> the positive part of the textNumberPattern. The negative part of the
>>>> pattern, if it exists, is used only to define the affix (prefix or suffix
>>>> or both) that indicate negative values.
>>>>
>>>> The problem is that positive numbers commonly have no affix, so the
>>>> position of padding characters relative to the affix cannot always be
>>>> determined from the positive pattern alone.
>>>>
>>>> Hence, if textNumberPattern specifies a pad character before the number
>>>> pattern and without a positive prefix, then ICU defaults to a pad position
>>>> of PAD_BEFORE_PREFIX with no way to change it with just the pattern.
>>>>
>>>> This behavior is reasonable for most cases, like when the pad character
>>>> is a space. However, if the pad character in textNumberPattern is '0', then
>>>> negative numbers are padded with a '0' before the negative sign. So we get
>>>> the errant behavior where a pattern of "*0####0" unparses -123 to "0-123".
>>>> This is very unlikely to be what the user wants with this pattern.
>>>>
>>>> Now suppose the positive pattern required a prefix "+" sign. The
>>>> textNumberPattern of "+*0####0" works properly because ICU determines that
>>>> the padding is PAD_AFTER_PREFIX from the positive pattern where the "*0" is
>>>> after the "+" prefix.
>>>>
>>>> The proposed fix to this issue that we're implementing in Daffodil is
>>>> this: If both negative and positive patterns define padding on the
>>>> same affix, and the positive pattern has an empty string for that affix,
>>>> then we use the pad position from the negative pattern. In all other cases,
>>>> the pad character in the negative pattern is ignored following usual ICU
>>>> behavior.
>>>>
>>>> For example, a textNumberPattern of "*0####0;-*00" formats a negative
>>>> number with zero padding after the hyphen, whereas normal ICU behavior
>>>> would ignore the negative pattern and zero pad before the hyphen.
>>>>
>>>>
>>>> I would suggest this is something that is a needed fix in general for
>>>> DFDL.
>>>>
>>>>
>>>> The workaround when you have the fixed length number use case (-9999
>>>> to 99999) is very ugly, treating the minus sign as an initiator and
>>>> creating separate elements for positive and negative values, or punting on
>>>> this integer and treating the whole thing as a string.
>>>>
>>>>
>>>> Arguably, this might be considered an ICU bug, but ICU's API can be
>>>> used to specify the PAD_AFTER_PREFIX behavior, so it's not like ICU doesn't
>>>> let you achieve the needed behavior, it's just not something that can be
>>>> achieved using only the ICU pattern string. But ICU maintainers may or may
>>>> not consider this to be a bug.
>>>>
>>>>
>>>> Mike Beckerle
>>>> Apache Daffodil PMC | daffodil.apache.org
>>>> OGF DFDL Workgroup Co-Chair |
>>>> www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
>>>> Owl Cyber Defense | www.owlcyberdefense.com
>>>>
>>>>
>>>> --
>>>>   dfdl-wg mailing list
>>>>   dfdl-wg at lists.ogf.org
>>>>   https://lists.ogf.org/mailman/listinfo/dfdl-wg
>>>>
>>> --
>   dfdl-wg mailing list
>   dfdl-wg at lists.ogf.org
>   https://lists.ogf.org/mailman/listinfo/dfdl-wg
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 12620 bytes
Desc: not available
URL: <https://lists.ogf.org/pipermail/dfdl-wg/attachments/20240314/aa9df414/attachment.txt>


More information about the dfdl-wg mailing list