[DFDL-WG] ICU and dfdl:textNumberPattern issue

Mike Beckerle mbeckerle.dfdl at gmail.com
Thu Mar 14 10:38:24 PDT 2024


Reporting back to the DFDL WG on how this issue got resolved on the
Daffodil project.

The issue came up in the NITF data format. The schema is on github
DFDLSchemas site.

Here is the long issue: https://github.com/DFDLSchemas/NITF/issues/19 read
intro then skip to this comment:
https://github.com/DFDLSchemas/NITF/issues/19#issuecomment-1889110915

Also related: https://issues.apache.org/jira/browse/DAFFODIL-2870 which is
is a open issue to warn about any ICU negative pattern chars that are
ignored.

Also related: https://github.com/apache/daffodil/pull/1139 which is a
related fix.

On Thu, Mar 14, 2024 at 11:55 AM Steve Hanson <smhdfdl at gmail.com> wrote:

> Mike, this is the thread about the negative pattern length issue.
>
> On Mon, Jan 15, 2024 at 11:10 AM Steve Hanson <smhdfdl at gmail.com> wrote:
>
>> I seem to recall hitting the same problem with X12 fixed length numerics
>> which could be positive or negative. If I get time before our next call
>> I'll re-familiarise myself with the X12 issue and what I did to work round
>> it.
>>
>> Regards
>> Steve
>>
>> On Thu, Jan 11, 2024 at 6:25 PM Mike Beckerle <mbeckerle at apache.org>
>> wrote:
>>
>>> We've run into an issue with dfdl:textNumberPattern, which is an ICU
>>> number pattern. I'll discuss here, and then suggest this is a fix needed in
>>> DFDL generally, but we should discuss that hypothesis.
>>>
>>> The motivating example is fixed length 5 character integer text data.
>>> The data ranges from -9999 to 99999. Note that the minus sign uses up one
>>> of the 5 characters that can be a digit for positive values.
>>>
>>> Consider the value -123 and textNumberPattern of "00000;-0". The value
>>> unparses as -00123 which is length 6 so too long.
>>>
>>> The padding feature of ICU number patterns can be used to "fix" this.
>>> Consider textNumberPattern="*0####0". The "*0" notation means to use 0 as
>>> the padding character to replace the "#" when needed.
>>> Now the value 123 unparses as 00123 but ... here's the problem.... -123
>>> unparses as 0-123. Notice how the zero padding is before the minus sign
>>> when we wanted it to appear after.
>>>
>>> This problem is caused by ICU taking nearly all the information from the
>>> positive part of the textNumberPattern. The negative part of the pattern,
>>> if it exists, is used only to define the affix (prefix or suffix or both)
>>> that indicate negative values.
>>>
>>> The problem is that positive numbers commonly have no affix, so the
>>> position of padding characters relative to the affix cannot always be
>>> determined from the positive pattern alone.
>>>
>>> Hence, if textNumberPattern specifies a pad character before the number
>>> pattern and without a positive prefix, then ICU defaults to a pad position
>>> of PAD_BEFORE_PREFIX with no way to change it with just the pattern.
>>>
>>> This behavior is reasonable for most cases, like when the pad character
>>> is a space. However, if the pad character in textNumberPattern is '0', then
>>> negative numbers are padded with a '0' before the negative sign. So we get
>>> the errant behavior where a pattern of "*0####0" unparses -123 to "0-123".
>>> This is very unlikely to be what the user wants with this pattern.
>>>
>>> Now suppose the positive pattern required a prefix "+" sign. The
>>> textNumberPattern of "+*0####0" works properly because ICU determines that
>>> the padding is PAD_AFTER_PREFIX from the positive pattern where the "*0" is
>>> after the "+" prefix.
>>>
>>> The proposed fix to this issue that we're implementing in Daffodil is
>>> this: If both negative and positive patterns define padding on the same
>>> affix, and the positive pattern has an empty string for that affix, then we
>>> use the pad position from the negative pattern. In all other cases, the pad
>>> character in the negative pattern is ignored following usual ICU behavior.
>>>
>>> For example, a textNumberPattern of "*0####0;-*00" formats a negative
>>> number with zero padding after the hyphen, whereas normal ICU behavior
>>> would ignore the negative pattern and zero pad before the hyphen.
>>>
>>>
>>> I would suggest this is something that is a needed fix in general for
>>> DFDL.
>>>
>>>
>>> The workaround when you have the fixed length number use case (-9999 to
>>> 99999) is very ugly, treating the minus sign as an initiator and creating
>>> separate elements for positive and negative values, or punting on this
>>> integer and treating the whole thing as a string.
>>>
>>>
>>> Arguably, this might be considered an ICU bug, but ICU's API can be used
>>> to specify the PAD_AFTER_PREFIX behavior, so it's not like ICU doesn't let
>>> you achieve the needed behavior, it's just not something that can be
>>> achieved using only the ICU pattern string. But ICU maintainers may or may
>>> not consider this to be a bug.
>>>
>>>
>>> Mike Beckerle
>>> Apache Daffodil PMC | daffodil.apache.org
>>> OGF DFDL Workgroup Co-Chair |
>>> www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
>>> Owl Cyber Defense | www.owlcyberdefense.com
>>>
>>>
>>> --
>>>   dfdl-wg mailing list
>>>   dfdl-wg at lists.ogf.org
>>>   https://lists.ogf.org/mailman/listinfo/dfdl-wg
>>>
>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 11910 bytes
Desc: not available
URL: <https://lists.ogf.org/pipermail/dfdl-wg/attachments/20240314/77486387/attachment.txt>


More information about the dfdl-wg mailing list