[DFDL-WG] Action 306 - IBM DFDL behaviour when parsing empty strings
Steve Hanson
smh at uk.ibm.com
Tue Jun 25 13:33:55 EDT 2019
Thinking this through, I don't think that IBM DFDL supporting
dfdl:nilValue="%ES;" is a problem. It's the first check that is made when
we get a rep with length 0.
However I think that the enum "treatAsMissing" would be better as
"treatAsAbsent" which is one of the specific cases of "missing" and what
we technically fall through to.
One subtlety not discussed is this ....
9.3.2.1 Simple element
If the result is length zero as described above, the representation is
then established by checking, in order for:
1. nil representation (if %ES; is a literal nil value).
2. empty representation.
3. normal representation (xs:string or xs:hexBinary only)
4. absent representation (if none of the prior representations
apply).
This is intended to handle the case when the length of the rep is 0 but we
are not conforming with EVDP. For example, element has init <a> & term
</a>, the EVDP is term only or init only, and the data contains "<a></a>".
We are not conforming to EVDP, so can't be empty, but we are conforming to
normal rep, not absent rep. The implication is that empty string would be
added to the infoset. I'm pretty sure that IBM DFDL will not add anything
to the infoset for this case, but I will test it to be sure.
The proposed name of the property does not capture this subtlety though
...
dfdl:emptyElementParsePolicy = ( "treatAsAbsent" | "treatAsEmpty" )
using 'zeroLength' instead of 'empty' goes too far the other away, as it
encompasses the nilValue=%ES case.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From: Steve Hanson/UK/IBM
To: Mike Beckerle <mbeckerle.dfdl at gmail.com>
Cc: DFDL-WG <dfdl-wg at ogf.org>
Date: 31/05/2019 14:42
Subject: Re: [DFDL-WG] Action 306 - IBM DFDL behaviour when parsing
empty strings
Re-reading this thread from the bottom, I'm not sure the proposal is
correct. I may have over-simplified it. I'm going to have to do some more
tests. Specifically around dfdl:nilValue="%ES;". IBM DFDL supports this,
which means we can't just be treating empty elements as missing all the
time.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From: Steve Hanson/UK/IBM
To: Mike Beckerle <mbeckerle.dfdl at gmail.com>
Cc: DFDL-WG <dfdl-wg at ogf.org>
Date: 09/05/2019 08:40
Subject: Re: [DFDL-WG] Action 306 - IBM DFDL behaviour when parsing
empty strings
It will hopefully be possible for you to apply the new 'treatAsMissing'
enum to just 2 places in the Daffodil code:
1) Empty rep found for required occurrence -> processing error
2) Empty rep found for optional occurrence -> don't add anything to
infoset
Maybe 4 places if you have separate paths for simple v complex.
That's an indicator of how conceptually simple this property is, once you
know the difference between empty & missing.
I am pretty sure the IBM DFDL behaviour deviation around empty/missing can
be encapsulated by just this.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From: Steve Hanson/UK/IBM
To: Mike Beckerle <mbeckerle.dfdl at gmail.com>
Cc: DFDL-WG <dfdl-wg at ogf.org>
Date: 09/05/2019 08:12
Subject: Re: [DFDL-WG] Action 306 - IBM DFDL behaviour when parsing
empty strings
I prefer
dfdl:emptyElementParsePolicy = ( "treatAsMissing" | "treatAsEmpty" )
You have to understand the difference between empty and missing in DFDL.
It has an effect on all types - for example, if you set "treatAsMissing"
for a required number, it means empty always causes a processing error
instead of potentially applying a default.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From: Mike Beckerle <mbeckerle.dfdl at gmail.com>
To: Steve Hanson <smh at uk.ibm.com>
Cc: DFDL-WG <dfdl-wg at ogf.org>
Date: 08/05/2019 19:26
Subject: Re: [DFDL-WG] Action 306 - IBM DFDL behaviour when parsing
empty strings
I suggest we stick with the "...Policy" naming convention for new things
that control modes of behavior.
I'd prefer to avoid the terms empty and missing in the property values and
go with something that is more explanatory of what difference it makes.
E.g, emptyElementParsePolicy with values
"excludeEmptyStringAndHexBinaryValues" and
"allowEmptyStringAndHexBinaryValues"
The doc for these values will of course have to be in terms of
Absent/Missing/Empty, etc. but at least the names give some intuition as
to what they control without having to understand all of DFDL's nuances
about the difference between what Absent and Missing is.
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
On Wed, May 8, 2019 at 12:52 PM Steve Hanson <smh at uk.ibm.com> wrote:
Maybe this is better;
dfdl:parseEmptyAsMissing = yes | no
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From: Mike Beckerle <mbeckerle.dfdl at gmail.com>
To: Steve Hanson <smh at uk.ibm.com>
Cc: DFDL-WG <dfdl-wg at ogf.org>
Date: 08/05/2019 16:48
Subject: Re: [DFDL-WG] Action 306 - IBM DFDL behaviour when parsing
empty strings
Interesting. Many DFDL schemas I've created have a simpleType defintion
named "nzString" which is string, plus an assertion that it is non-empty.
That's to achieve exactly the behavior you have in IBM DFDL, because, as
you say, many formats want this.
We could rename the suggested property emptyElementParsePolicy to make it
clear it is only about parsing.
I like treatAsMissing. Easy to say what it means.
treatAsEmpty begs the question of what empty elements do, but that's
already complicated in the spec due to optionals and EVDP, so I'm happy
with this also.
...mikeb
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
On Tue, May 7, 2019 at 3:48 AM Steve Hanson <smh at uk.ibm.com> wrote:
Hi Mike
I think what you have highlighted is that there are formats which require
that empty elements should not be treated as empty but as missing, which
is effectively what IBM DFDL is doing (our code was written prior to
action 140 when there was no distinction between empty & missing). That
could be achieved with assertions. So maybe we should view the new
property as a convenience property for such formats, as well as handling
IBM DFDL's behaviour?
If so, then can I suggest new names for the enums, which I think makes the
intent clearer?
dfdl:emptyElementPolicy = ( "treatAsMissing" | "treatAsEmpty" )
This only applies when parsing, maybe names should reflect that also?
Further, "treatAsMissing" would imply that a default value was never used
when parsing, as they are only used when the representation is empty. I
think we can do away with the SDE clause for "treatAsMissing". The clause
is only needed for "treatAsEmpty".
IBM DFDL does implement nillable processing, including use of ES as nil
literal value.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From: Mike Beckerle <mbeckerle.dfdl at gmail.com>
To: Steve Hanson <smh at uk.ibm.com>
Cc: DFDL-WG <dfdl-wg at ogf.org>
Date: 03/05/2019 21:23
Subject: Re: [DFDL-WG] Action 306 - IBM DFDL behaviour when parsing
empty strings
Under testing with the EDIFACT schema (from DFDLSchemas on github) against
new code in daffodil, I see that my proposal was not sufficient.
Steve Hanson stated that IBM DFDL current behavior for required empty
strings includes "An empty occurrence with no default gives a Processing
Error."
I misinterpreted this. I was thinking required occurrence of an array
element (as in with index <= minOccurs). But this should not be
interpreted that narrowly, but any required occurrence at all including
scalar elements. The EDIFACT schema depends on this behavior and
backtracking driven by it, in order to work.
So my suggestion for new properties to control this is revised to:
dfdl:emptyElementPolicy enum with values
noEmptyElements - matches current IBM DFDL behavior where
* required elements without default values that are empty (specifically
which satisfy the empty syntax - defined below) always cause Processing
Errors.
** If a default value is specified that is provided as the value instead.
When a default value is specified, then implementations that don't support
default values when parsing must issue a runtime SDE here, not a
processing error.
* optional elements which satisfy the empty syntax are not added to the
infoset. Defaulting is never considered.
emptyElements - matches current description in the DFDL spec where
* required elements: if the string/hexBinary satisfies the empty syntax
then required elements are created with an empty string or empty hexBinary
as their value. If a default value is specified that is substituted as the
value instead. When a default value is specified, then implementations
that don't support default values when parsing must issue a SDE here, not
a processing error.
* optional elements: if the string/hexBinary satisfies the empty syntax,
and emptyValueDelimiterPolicy is not 'none' then an empty string (or
hexbinary) is added to the infoset. If emptyValueDelimiterPolicy is
'none', nothing is added to the infoset.
The term "satisfy the empty syntax" means what is found in the data stream
may require initiator and/or terminator depending on
emptyValueDelimiterPolicy, but if that is 'none' then this is satisfied
just by empty string (or no bytes for hexBinary).
Having said the above, I believe we also have to consider nillable
elements.
There are two topics:
1) defaulting to nilled - For the case of a nillable element, where the
data syntax does NOT match the nil representation, then in the above
anywhere a default value is specified, and there is behavior associated
with that, well if the element is nillable, and
dfdl:useNilAsDefault='true' is specified, then the element is default
valued to being nilled. When nillable and dfdl:useNilAsDefault='true' is
specified, then implementations that don't support defaulting to nilled
when parsing must issue an SDE here, not a processing error.
That takes care of the defaulting aspect of nillables.
The second topic is:
2) nillable, and dfdl:nilValue contains %ES; as one of the possible nil
representations. Hence, there is the possibility of empty string (or empty
hexBinary) matching the nil representation.
I think the DFDL spec is clear here that if the data stream satisfies the
nil syntax, then required or optional, you get a nilled element, period.
Does IBM DFDL implement that behavior? If so great. If not I think we may
have to amend the above description of noEmptyElements case for
dfdl:emptyElementPolicy to specify the special cases.
...mikeb
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
On Sun, Apr 28, 2019 at 9:36 AM Mike Beckerle <mbeckerle.dfdl at gmail.com>
wrote:
One clarification: is the IBM DFDL behavior the same for empty hexBinary
elements as it is for text strings?
I'm going to suggest we need a policy property e.g.,
dfdl:emptyElementPolicy which is an enum with at least these options:
noOptionalEmptyElements - matches current IBM DFDL behavior
optionalEmptyElementsWithSyntax - matches current description in the DFDL
spec where initiator and/or terminator found triggers creation of an empty
string value. (Daffodil implements this.)
This would apply (I think) to both types xs:string ad xs:hexBinary
I'm open to suggestions for better naming for the property and the
property values, but these are the two settings we need I think.
I do believe that the latter optionalEmptyElementsWithSyntax behavior is
what the DFDL spec describes, and is most consistent given the available
properties such as emptyValueDelimiterPolicy.
We can make implementation of optionalEmptyElementsWithSyntax a DFDL
optional language feature, thereby avoiding issues of conformance with the
DFDL standard.
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
On Fri, Apr 5, 2019 at 12:43 PM Steve Hanson <smh at uk.ibm.com> wrote:
Daffodil to perform identical tests but the belief is that they implement
the spec as published (except maybe for one bug with default values for
strings).
So there is a mis-match between Daffodil and IBM DFDL. It sounds like a
new property is going to be needed which toggles the way that empty
strings are handled.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From: Steve Hanson/UK/IBM
To: DFDL-WG <dfdl-wg at ogf.org>
Cc: "Mike Beckerle" <mbeckerle at tresys.com>, "Michele Zundo" <
michele.zundo at esa.int>, Bradd Kadlecik/Poughkeepsie/IBM at IBMUS
Date: 03/04/2019 12:04
Subject: Action 306 - IBM DFDL behaviour when parsing empty strings
306
Confirm IBM DFDL behaviour when parsing empty strings (Steve)
7/8: IBM DFDL has not fully implemented the behaviour changes arising from
action 140 with respect to empty string elements. Daffodil is about to do
so. IBM DFDL users have complained about lack of defaults when parsing but
other than that appear happy. Are the rules in the spec for empty strings
over complicated? Steve to document the behaviour for IBM DFDL to inform
the discussion.
...
1/11: In progress - there are a lot of subtle scenarios
15/11: Not discussed
...
7/2/19: No further progress
Some progress :)
9.4.2.2 Simple element (xs:string or xs:hexBinary)
Required occurrence: If the element has a default value then an item is
added to the infoset using the default value, otherwise an item is added
to the Infoset using empty string (type xs:string) or empty hexBinary
(type xs:hexBinary) as the value.
Optional occurrence: If dfdl:emptyValueDelimiterPolicy is not 'none' then
an item is added to the Infoset using empty string (type xs:string) or
empty hexBinary (type xs:hexBinary) as the value, otherwise nothing is
added to the Infoset.
IBM DFDL behaviour:
Required. IBM DFDL does not implement default values when parsing, so an
empty occurrence with a default value gives an SDE (to prevent
backtracking). An empty occurrence with no default gives a Processing
Error. If you need to add an empty string to the infoset, you can add
default=""(when default values implemented, of course).
Optional. IBM DFDL adds nothing to the infoset regardless of presence of
initiator and/or terminator. No way to get empty string into the infoset.
9.4.2.3 Complex element
Required occurrence: An item is added to the Infoset.
Optional occurrence: If dfdl:emptyValueDelimiterPolicy is not 'none' then
an item is added to the Infoset, otherwise nothing is added to the
Infoset.
For both required and optional occurrences, the Infoset item may also have
a child item.
1. If the first child element of the complex type is a required
simple element, then an empty string (type xs:string), empty hexBinary
(type xs:hexBinary), or default value will also be added to the Infoset.
2. If the first child element of the complex type is a required
complex element, then an item is added to the Infoset (which may itself
have a child via (1))
IBM DFDL behaviour:
Required. IBM DFDL follows the spec (modulo 1 when an error would have
been thrown, as per its 9.4.2.2 behaviour).
Optional. IBM DFDL follows the spec (modulo 1 when an error would have
been thrown, as per its 9.4.2.2 behaviour).
So ...
The spec today is consistent in one way, in that for both complex & string
elements a) a required empty occurrence always adds to the infoset; & b)
an optional empty occurrence adds to the infoset if initiator/terminator
present; & c) an optional empty occurrence does not add to the infoset if
no initiator/terminator present.
If the simple string behaviour was to change to match IBM DFDL then that
consistency is lost, but the string behaviour then matches that for other
simple types. Section 9.4.2.2 disappears as the behaviour is same as
9.4.2.1. Section 9.4.2.3 becomes as below. We lose the ability to get an
empty string into the infoset for an optional string with
initiator/terminator.
9.4.2.3 Complex element
Required occurrence: An item is added to the Infoset.
Optional occurrence: If dfdl:emptyValueDelimiterPolicy is not 'none' then
an item is added to the Infoset, otherwise nothing is added to the
Infoset.
For both required and optional occurrences, the Infoset item may also have
a child item.
1. If the first child element of the complex type is a required
simple element, then a default value will also be added to the Infoset.
2. If the first child element of the complex type is a required
complex element, then an item is added to the Infoset (which may itself
have a child via (1))
We also need to be sure that any other implementations have not yet
implemented the current spec behaviour. Need to check with DFDL4S and IBM
TPF.
To be discussed on next WG call ...
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
--
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20190625/bc4137c4/attachment-0001.html>
More information about the dfdl-wg
mailing list