[DFDL-WG] clarification needed? dfdl:textNumberCheckPolicy 'strict' - language suggests more strict than ICU libraries

Steve Hanson smh at uk.ibm.com
Wed Sep 2 08:58:45 EDT 2020


I agree with your interpretation of grouping separator behaviour.

Yes it's possible that the 'strict' words are out of date.  It feels like 
we tried to summarise strict behaviour to avoid listing the specifics, but 
didn't get it quite right.

I will add to agenda for tomorrow's call.

Regards
 
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday 



From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     DFDL-WG <dfdl-wg at ogf.org>
Date:   01/09/2020 14:54
Subject:        [EXTERNAL] [DFDL-WG] clarification needed? 
dfdl:textNumberCheckPolicy 'strict' - language suggests more strict than 
ICU libraries
Sent by:        "dfdl-wg" <dfdl-wg-bounces at ogf.org>




Say we have this schema snippet:
<xs:element name="SimpleDataFormat">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="NumStudents" type="xs:nonNegativeInteger" 
                dfdl:textNumberCheckPolicy="strict"
                dfdl:textNumberPattern="#,###"
                dfdl:textStandardGroupingSeparator=","
                dfdl:textStandardDecimalSeparator="."
            />
        </xs:sequence>
    </xs:complexType>
</xs:element>

This successfully parses the data
1234

Even though textNumberCheckPolicy="strict" and the pattern contains a 
grouping separator, it still allows data that does not contain grouping 
separators. 
That said, we have generally tried to make DFDL's spec match the behavior 
of the ICU library for parsing numbers based on the textNumberPattern. 
This library has this to say about strict parsing of numbers:
The following conditions cause a parse failure relative to [lax] mode
(examples use the pattern "#,##0.#"):
The presence and position of special symbols, including currency,
must match the pattern.
'+123' fails (there is no plus sign in the pattern)
Leading or doubled grouping separators
',123' and '1,,234" fail
Groups of incorrect length when grouping is used
'1,23' and '1234,567' fail, but '1234' passes
Grouping separators used in numbers followed by exponents
'1,234E5' fails, but '1234E5' and '1,234E' pass ('E' is not an 
exponent when not followed by a number)
So based on ICU's description of strict, this is the expected behavior. It 
doesn't say anything about missing grouping separators causing an error. 
Only that if they do exist then they must be in the right spot.
The only thing the DFDL specification mentions regarding strict numbers is 
this:
If 'strict' and dfdl:textNumberRep is 'standard' then the data must 
follow the pattern with the exceptions that digits 0-9, decimal 
separator and exponent separator are always recognised and parsed
To me, that reads like the decimal separator should always be required in 
strict mode, so this feels like the ICU behavior and the behavior 
described in the DFDL specification do not match. And I believe the DFDL 
behavior was intended to match ICU behavior, so it's possible the DFDL 
specification needs to be updated.

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Owl Cyber Defense | 
www.owlcyberdefense.com
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject to the OGF Intellectual Property Policy
--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  
https://www.ogf.org/mailman/listinfo/dfdl-wg 



Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20200902/f8636669/attachment.html>


More information about the dfdl-wg mailing list