[DFDL-WG] lax calendar parsing

Mike Beckerle mbeckerle.dfdl at gmail.com
Mon May 18 12:24:32 EDT 2020


We discussed lax number processing a while back. We have the same issue
with lax calendar parsing.

The DFDL spec has this language:


   - Additional lenient parsing behaviour when in 'lax' mode:


   1. Values outside valid ranges are normalized (eg, "March 32 1996" is
   treated as "April 1 1996")
   2. Ignoring a trailing dot after a non-numeric field
   3. Leading and trailing whitespace in the data but not in the pattern is
   accepted
   4. Whitespace in the pattern can be missing in the data
   5. Partial matching on literal strings. E.g., data "20130621d" allowed
   for pattern "yyyyMMdd'date' "

I suggest that the first line of that needs to add the word "may" as in
"Additional lenient parsing behaviour when in 'lax' mode MAY include:"

This is because we've discovered that lax behavior in the ICU libraries we
rely on varies from ICU-release to release. So I think we have to make the
spec consistent with the idea that "lax" parsing for numbers and calendars
is implementation-dependent, and really only "strict" behavior can be
relied upon to be durably meaningful even across releases of the same DFDL
implementation.

This doesn't make "lax" behavior entirely useless. Consider you are just
doing a one-time conversion of some data from a native format to JSON, or
XML, or to get it into your favorite data-integration tool. If you can get
it to work one-time using "lax" that's ok, because you intend to discard
the schema once your one-time conversion is complete.

So it doesn't bother me to have lax behavior. I think we just want to say
that you can't rely on it to be consistent, and you can't rely on it to
actually be any different from 'strict' behavior.

I think the alternatives are:
1) that we end up having to fork ICU libraries, carefully characterize lax
behavior in that fork, and maintain it ourselves for ever after. (I really
don't like this option. I'm just mentioning it to point out the difficulty)
2) deprecate and remove 'lax' behavior entirely and the properties
associated with specifying it.
3) make 'lax' an optional DFDL feature, so implementations can choose to
not bother implementing it.




Mike Beckerle | OGF DFDL Workgroup Co-Chair | Owl Cyber Defense |
www.owlcyberdefense.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
<http://www.ogf.org/About/abt_policies.php>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20200518/f0a9ffdf/attachment.html>


More information about the dfdl-wg mailing list