[DFDL-WG] ICU and dfdl:textNumberPattern issue

Mike Beckerle mbeckerle at apache.org
Fri Jan 12 06:03:09 PST 2024


Ok, after a week agonizing over this problem, a simple textNumberPattern
has been found (not by me) which works for this 5 characters -9999 to 99999
problem.

dfdl:textNumberPattern="*0#0000"

This works because the pad character ICU uses is zero, but it is only
needed as a pad character for positive numbers. Negative numbers will be 4
digits to start with, and the "#" will become the minus sign and no padding
is needed.

The suggested fix described in this thread is still needed for obscure
situations like where the padding is a non-numeric character like spaces
after the sign, so for numbers like "+     1234.56", or even "+
xxxx1234.56", where 'x' as  padding needs to be removed on number parsing.

Follow up question though. Has anyone ever seen *data* like that, with the
"x" for leading numeric unused digits? ( I don't mean printed on a check -
we've all seen that, I mean in data.)
This representation is clearly possible. I just don't know of any
real-world example of it. I made it up based on having seen checks printed
that way.

I have definitely seen "+    1234.56" where the sign is first in the field
regardless of the length of the number. I've also seen "    1234.56+" where
the sign is trailing.
But I can't say I've seen "+ xxxx1234.56" in data.

On Thu, Jan 11, 2024 at 1:25 PM Mike Beckerle <mbeckerle at apache.org> wrote:

> We've run into an issue with dfdl:textNumberPattern, which is an ICU
> number pattern. I'll discuss here, and then suggest this is a fix needed in
> DFDL generally, but we should discuss that hypothesis.
>
> The motivating example is fixed length 5 character integer text data. The
> data ranges from -9999 to 99999. Note that the minus sign uses up one of
> the 5 characters that can be a digit for positive values.
>
> Consider the value -123 and textNumberPattern of "00000;-0". The value
> unparses as -00123 which is length 6 so too long.
>
> The padding feature of ICU number patterns can be used to "fix" this.
> Consider textNumberPattern="*0####0". The "*0" notation means to use 0 as
> the padding character to replace the "#" when needed.
> Now the value 123 unparses as 00123 but ... here's the problem.... -123
> unparses as 0-123. Notice how the zero padding is before the minus sign
> when we wanted it to appear after.
>
> This problem is caused by ICU taking nearly all the information from the
> positive part of the textNumberPattern. The negative part of the pattern,
> if it exists, is used only to define the affix (prefix or suffix or both)
> that indicate negative values.
>
> The problem is that positive numbers commonly have no affix, so the
> position of padding characters relative to the affix cannot always be
> determined from the positive pattern alone.
>
> Hence, if textNumberPattern specifies a pad character before the number
> pattern and without a positive prefix, then ICU defaults to a pad position
> of PAD_BEFORE_PREFIX with no way to change it with just the pattern.
>
> This behavior is reasonable for most cases, like when the pad character is
> a space. However, if the pad character in textNumberPattern is '0', then
> negative numbers are padded with a '0' before the negative sign. So we get
> the errant behavior where a pattern of "*0####0" unparses -123 to "0-123".
> This is very unlikely to be what the user wants with this pattern.
>
> Now suppose the positive pattern required a prefix "+" sign. The
> textNumberPattern of "+*0####0" works properly because ICU determines that
> the padding is PAD_AFTER_PREFIX from the positive pattern where the "*0" is
> after the "+" prefix.
>
> The proposed fix to this issue that we're implementing in Daffodil is
> this: If both negative and positive patterns define padding on the same
> affix, and the positive pattern has an empty string for that affix, then we
> use the pad position from the negative pattern. In all other cases, the pad
> character in the negative pattern is ignored following usual ICU behavior.
>
> For example, a textNumberPattern of "*0####0;-*00" formats a negative
> number with zero padding after the hyphen, whereas normal ICU behavior
> would ignore the negative pattern and zero pad before the hyphen.
>
>
> I would suggest this is something that is a needed fix in general for
> DFDL.
>
>
> The workaround when you have the fixed length number use case (-9999 to
> 99999) is very ugly, treating the minus sign as an initiator and creating
> separate elements for positive and negative values, or punting on this
> integer and treating the whole thing as a string.
>
>
> Arguably, this might be considered an ICU bug, but ICU's API can be used
> to specify the PAD_AFTER_PREFIX behavior, so it's not like ICU doesn't let
> you achieve the needed behavior, it's just not something that can be
> achieved using only the ICU pattern string. But ICU maintainers may or may
> not consider this to be a bug.
>
>
> Mike Beckerle
> Apache Daffodil PMC | daffodil.apache.org
> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
> Owl Cyber Defense | www.owlcyberdefense.com
>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 11129 bytes
Desc: not available
URL: <https://lists.ogf.org/pipermail/dfdl-wg/attachments/20240112/b640c3fb/attachment.txt>


More information about the dfdl-wg mailing list