[DFDL-WG] ICU and dfdl:textNumberPattern issue

Mike Beckerle mbeckerle at apache.org
Thu Jan 11 10:25:22 PST 2024


We've run into an issue with dfdl:textNumberPattern, which is an ICU number
pattern. I'll discuss here, and then suggest this is a fix needed in DFDL
generally, but we should discuss that hypothesis.

The motivating example is fixed length 5 character integer text data. The
data ranges from -9999 to 99999. Note that the minus sign uses up one of
the 5 characters that can be a digit for positive values.

Consider the value -123 and textNumberPattern of "00000;-0". The value
unparses as -00123 which is length 6 so too long.

The padding feature of ICU number patterns can be used to "fix" this.
Consider textNumberPattern="*0####0". The "*0" notation means to use 0 as
the padding character to replace the "#" when needed.
Now the value 123 unparses as 00123 but ... here's the problem.... -123
unparses as 0-123. Notice how the zero padding is before the minus sign
when we wanted it to appear after.

This problem is caused by ICU taking nearly all the information from the
positive part of the textNumberPattern. The negative part of the pattern,
if it exists, is used only to define the affix (prefix or suffix or both)
that indicate negative values.

The problem is that positive numbers commonly have no affix, so the
position of padding characters relative to the affix cannot always be
determined from the positive pattern alone.

Hence, if textNumberPattern specifies a pad character before the number
pattern and without a positive prefix, then ICU defaults to a pad position
of PAD_BEFORE_PREFIX with no way to change it with just the pattern.

This behavior is reasonable for most cases, like when the pad character is
a space. However, if the pad character in textNumberPattern is '0', then
negative numbers are padded with a '0' before the negative sign. So we get
the errant behavior where a pattern of "*0####0" unparses -123 to "0-123".
This is very unlikely to be what the user wants with this pattern.

Now suppose the positive pattern required a prefix "+" sign. The
textNumberPattern of "+*0####0" works properly because ICU determines that
the padding is PAD_AFTER_PREFIX from the positive pattern where the "*0" is
after the "+" prefix.

The proposed fix to this issue that we're implementing in Daffodil is this: If
both negative and positive patterns define padding on the same affix, and
the positive pattern has an empty string for that affix, then we use the
pad position from the negative pattern. In all other cases, the pad
character in the negative pattern is ignored following usual ICU behavior.

For example, a textNumberPattern of "*0####0;-*00" formats a negative
number with zero padding after the hyphen, whereas normal ICU behavior
would ignore the negative pattern and zero pad before the hyphen.


I would suggest this is something that is a needed fix in general for DFDL.


The workaround when you have the fixed length number use case (-9999 to
99999) is very ugly, treating the minus sign as an initiator and creating
separate elements for positive and negative values, or punting on this
integer and treating the whole thing as a string.


Arguably, this might be considered an ICU bug, but ICU's API can be used to
specify the PAD_AFTER_PREFIX behavior, so it's not like ICU doesn't let you
achieve the needed behavior, it's just not something that can be achieved
using only the ICU pattern string. But ICU maintainers may or may not
consider this to be a bug.


Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 9229 bytes
Desc: not available
URL: <https://lists.ogf.org/pipermail/dfdl-wg/attachments/20240111/6da20fff/attachment.txt>


More information about the dfdl-wg mailing list