[DFDL-WG] ICU and maximum integer digits

Wed Jul 11 09:09:58 EDT 2012

Agreed on WG call that the silent truncation of integers was not 
desirable, and that the ability for variable length text numbers to work 
with a pattern with a generic integer count was useful. DFDL will go with 
the ICU default behaviour.

Errata therefore taken to change the spec words about 'maximum integer 
digits'.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848

From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     Steve Hanson/UK/IBM at IBMGB
Cc:     Richard Schofield/UK/IBM at IBMGB, dfdl-wg at ogf.org
Date:   09/07/2012 19:17
Subject:        Re: [DFDL-WG] ICU and maximum integer digits
Sent by:        dfdl-wg-bounces at ogf.org

One additional detail. What if the element is fixed length 2 characters so 
there simply is no option other than truncate left or right or processing 
error I guess.
In that case which 2 digits of 1997 you get should perhaps depend on the 
number justification. If left truncate off the right so you get 19. If 
right truncate off left to get 97.
Since numbers are usually right justified this is the typical behavior and 
I think what the ICU lib is assuming, patterns being most common in fixed 
length data.
In variable length data I agree with your analysis which is to never 
truncate most significant digits.
On Jul 9, 2012 2:00 PM, "Steve Hanson" <smh at uk.ibm.com> wrote:
The ICU web documentation says the following about formatting (unparsing) 
numbers, which has been copied into the DFDL specification: 
If the number of actual integer digits exceeds the maximum integer digits, 
then only the least significant digits are shown. For example, 1997 is 
formatted as "97" if the maximum integer digits is set to 2. 
If the number of actual integer digits is less than the minimum integer 
digits, then leading zeros are added. For example, 1997 is formatted as 
"01997" if the minimum integer digits is set to 5. 
If the number of actual fraction digits exceeds the maximum fraction 
digits, then rounding is performed to the maximum fraction digits. For 
example, 0.125 is formatted as "0.12" if the maximum fraction digits is 2. 
This behavior can be changed by specifying a rounding increment and/or a 
rounding mode. 
If the number of actual fraction digits is less than the minimum fraction 
digits, then trailing zeros are added. For example, 0.125 is formatted as 
"0.1250" if the mimimum fraction digits is set to 4.

The latest draft of the spec has incorporated errata 2.29 which now 
defines the terms maximum integer digits, etc, and does so in terms of the 
pattern. 
·        The term maximum fraction digits is the total number of ‘0’ and 
‘#’ characters in the fraction sub-pattern above. 
·        The term minimum fraction digits is the total number of ‘0’ 
characters (only) in the fraction sub-pattern above. 
·        The term maximum integer digits is the total number of ‘0’ and 
‘#’ characters in the integer sub-pattern above. 
·        The term minimum integer digits is the total number of ‘0’ 
characters (only) in the integer sub-pattern above. 
That all looks to make sense, but on close reading the ICU behaviour of 
maximum integer digits appears to be undesirable, in that it will silently 
truncate oversize integer portions.  From above "For example, 1997 is 
formatted as "97" if the maximum integer digits is set to 2." 

Interestingly, while ICU derives minimum integer digits, minimum fraction 
digits and maximum fraction digits from the pattern, ICU does not derive 
maximum integer digits from the pattern and instead uses a default of 309. 
There is an explicit ICU API call that you have to make to set it.   

Because of this inconsistent ICU behaviour, the IBM DFDL implementation 
has omitted to use this ICU API today, and so allows up to 309 digits to 
be formatted regardless of pattern. Eg, "#0" works for infoset values"1", 
"12", "123456789" without any loss. As well as avoiding the silent 
truncation, this is convenient for variable length text numbers, as a 
single textNumberPattern value such as "#0" can be set in scope and widely 
used, but means variable length text numbers do not have their integer 
digit length policed (fixed length text numbers are policed by length of 
element).   

I think it is worth ratifying that the spec words above are the true 
intended behaviour, and if so noting that an implementation should a) not 
set the ICU API, b) let the maximum integer digits default to 309, and c) 
implement maximum integer digits processing itself. 

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20120711/7da5191d/attachment.html>