[DFDL-WG] clarification needed? dfdl:textNumberCheckPolicy 'strict' - language suggests more strict than ICU libraries

Mike Beckerle mbeckerle.dfdl at gmail.com
Tue Sep 1 09:54:16 EDT 2020


Say we have this schema snippet:

<xs:element name="SimpleDataFormat">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="NumStudents" type="xs:nonNegativeInteger"
                dfdl:textNumberCheckPolicy="strict"
                dfdl:textNumberPattern="#,###"
                dfdl:textStandardGroupingSeparator=","
                dfdl:textStandardDecimalSeparator="."
            />
        </xs:sequence>
    </xs:complexType></xs:element>

This successfully parses the data

1234

Even though textNumberCheckPolicy="strict" and the pattern contains a
grouping separator, it still allows data that does not contain grouping
separators.

That said, we have generally tried to make DFDL's spec match the behavior
of the ICU library for parsing numbers based on the textNumberPattern. This
library has this to say about strict parsing of numbers:

The following conditions cause a parse failure relative to [lax] mode
(examples use the pattern "#,##0.#"):

   - The presence and position of special symbols, including currency,
   must match the pattern.

'+123' fails (there is no plus sign in the pattern)

   - Leading or doubled grouping separators

',123' and '1,,234" fail

   - Groups of incorrect length when grouping is used

'1,23' and '1234,567' fail, but '1234' passes

   - Grouping separators used in numbers followed by exponents

'1,234E5' fails, but '1234E5' and '1,234E' pass ('E' is not an
exponent when not followed by a number)

So based on ICU's description of strict, this is the expected behavior. It
doesn't say anything about missing grouping separators causing an error.
Only that if they do exist then they must be in the right spot.

The only thing the DFDL specification mentions regarding strict numbers is
this:

If 'strict' and dfdl:textNumberRep is 'standard' then the data must
follow the pattern with the exceptions that digits 0-9, decimal
separator and exponent separator are always recognised and parsed

To me, that reads like the decimal separator should always be required in
strict mode, so this feels like the ICU behavior and the behavior described
in the DFDL specification do not match. And I believe the DFDL behavior was
intended to match ICU behavior, so it's possible the DFDL specification
needs to be updated.

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Owl Cyber Defense |
www.owlcyberdefense.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
<http://www.ogf.org/About/abt_policies.php>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20200901/2291669d/attachment.html>


More information about the dfdl-wg mailing list