[DFDL-WG] ICU and dfdl:textNumberPattern issue

Steve Hanson smhdfdl at gmail.com
Mon Jan 15 03:10:56 PST 2024


I seem to recall hitting the same problem with X12 fixed length numerics
which could be positive or negative. If I get time before our next call
I'll re-familiarise myself with the X12 issue and what I did to work round
it.

Regards
Steve

On Thu, Jan 11, 2024 at 6:25 PM Mike Beckerle <mbeckerle at apache.org> wrote:

> We've run into an issue with dfdl:textNumberPattern, which is an ICU
> number pattern. I'll discuss here, and then suggest this is a fix needed in
> DFDL generally, but we should discuss that hypothesis.
>
> The motivating example is fixed length 5 character integer text data. The
> data ranges from -9999 to 99999. Note that the minus sign uses up one of
> the 5 characters that can be a digit for positive values.
>
> Consider the value -123 and textNumberPattern of "00000;-0". The value
> unparses as -00123 which is length 6 so too long.
>
> The padding feature of ICU number patterns can be used to "fix" this.
> Consider textNumberPattern="*0####0". The "*0" notation means to use 0 as
> the padding character to replace the "#" when needed.
> Now the value 123 unparses as 00123 but ... here's the problem.... -123
> unparses as 0-123. Notice how the zero padding is before the minus sign
> when we wanted it to appear after.
>
> This problem is caused by ICU taking nearly all the information from the
> positive part of the textNumberPattern. The negative part of the pattern,
> if it exists, is used only to define the affix (prefix or suffix or both)
> that indicate negative values.
>
> The problem is that positive numbers commonly have no affix, so the
> position of padding characters relative to the affix cannot always be
> determined from the positive pattern alone.
>
> Hence, if textNumberPattern specifies a pad character before the number
> pattern and without a positive prefix, then ICU defaults to a pad position
> of PAD_BEFORE_PREFIX with no way to change it with just the pattern.
>
> This behavior is reasonable for most cases, like when the pad character is
> a space. However, if the pad character in textNumberPattern is '0', then
> negative numbers are padded with a '0' before the negative sign. So we get
> the errant behavior where a pattern of "*0####0" unparses -123 to "0-123".
> This is very unlikely to be what the user wants with this pattern.
>
> Now suppose the positive pattern required a prefix "+" sign. The
> textNumberPattern of "+*0####0" works properly because ICU determines that
> the padding is PAD_AFTER_PREFIX from the positive pattern where the "*0" is
> after the "+" prefix.
>
> The proposed fix to this issue that we're implementing in Daffodil is
> this: If both negative and positive patterns define padding on the same
> affix, and the positive pattern has an empty string for that affix, then we
> use the pad position from the negative pattern. In all other cases, the pad
> character in the negative pattern is ignored following usual ICU behavior.
>
> For example, a textNumberPattern of "*0####0;-*00" formats a negative
> number with zero padding after the hyphen, whereas normal ICU behavior
> would ignore the negative pattern and zero pad before the hyphen.
>
>
> I would suggest this is something that is a needed fix in general for
> DFDL.
>
>
> The workaround when you have the fixed length number use case (-9999 to
> 99999) is very ugly, treating the minus sign as an initiator and creating
> separate elements for positive and negative values, or punting on this
> integer and treating the whole thing as a string.
>
>
> Arguably, this might be considered an ICU bug, but ICU's API can be used
> to specify the PAD_AFTER_PREFIX behavior, so it's not like ICU doesn't let
> you achieve the needed behavior, it's just not something that can be
> achieved using only the ICU pattern string. But ICU maintainers may or may
> not consider this to be a bug.
>
>
> Mike Beckerle
> Apache Daffodil PMC | daffodil.apache.org
> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
> Owl Cyber Defense | www.owlcyberdefense.com
>
>
> --
>   dfdl-wg mailing list
>   dfdl-wg at lists.ogf.org
>   https://lists.ogf.org/mailman/listinfo/dfdl-wg
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 10103 bytes
Desc: not available
URL: <https://lists.ogf.org/pipermail/dfdl-wg/attachments/20240115/7fccecb8/attachment.txt>


More information about the dfdl-wg mailing list