[DFDL-WG] Action 306 - IBM DFDL behaviour when parsing empty strings

Steve Hanson smh at uk.ibm.com
Tue Jun 25 13:33:55 EDT 2019


Thinking this through, I don't think that IBM DFDL supporting 
dfdl:nilValue="%ES;" is a problem. It's the first check that is made when 
we get a rep with length 0.

However I think that the enum "treatAsMissing" would be better as 
"treatAsAbsent" which is one of the specific cases of "missing" and what 
we technically fall through to.

One subtlety not discussed is this ....
9.3.2.1 Simple element
If the result is length zero as described above, the representation is 
then established by checking, in order for:
 1.     nil representation (if %ES; is a literal nil value). 
 2.     empty representation.
 3.     normal representation (xs:string or xs:hexBinary only)
 4.     absent representation (if none of the prior representations 
apply). 

This is intended to handle the case when the length of the rep is 0 but we 
are not conforming with EVDP. For example, element has init <a> & term 
</a>, the EVDP is term only or init only, and the data contains "<a></a>". 
We are not conforming to EVDP, so can't be empty, but we are conforming to 
normal rep, not absent rep. The implication is that empty string would be 
added to the infoset.  I'm pretty sure that IBM DFDL will not add anything 
to the infoset for this case, but I will test it to be sure.

The proposed name of the property does not capture this subtlety though 
...

dfdl:emptyElementParsePolicy = ( "treatAsAbsent" | "treatAsEmpty" ) 

using 'zeroLength' instead of 'empty' goes too far the other away, as it 
encompasses the nilValue=%ES case.

Regards
 
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday 



From:   Steve Hanson/UK/IBM
To:     Mike Beckerle <mbeckerle.dfdl at gmail.com>
Cc:     DFDL-WG <dfdl-wg at ogf.org>
Date:   31/05/2019 14:42
Subject:        Re: [DFDL-WG] Action 306 - IBM DFDL behaviour when parsing 
empty strings


Re-reading this thread from the bottom, I'm not sure the proposal is 
correct. I may have over-simplified it. I'm going to have to do some more 
tests. Specifically around dfdl:nilValue="%ES;". IBM DFDL supports this, 
which means we can't just be treating empty elements as missing all the 
time. 

Regards
 
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday 




From:   Steve Hanson/UK/IBM
To:     Mike Beckerle <mbeckerle.dfdl at gmail.com>
Cc:     DFDL-WG <dfdl-wg at ogf.org>
Date:   09/05/2019 08:40
Subject:        Re: [DFDL-WG] Action 306 - IBM DFDL behaviour when parsing 
empty strings


It will hopefully be possible for you to apply the new 'treatAsMissing' 
enum to just 2 places in the Daffodil code:

1) Empty rep found for required occurrence -> processing error

2) Empty rep found for optional occurrence -> don't add anything to 
infoset

Maybe 4 places if you have separate paths for simple v complex.

That's an indicator of how conceptually simple this property is, once you 
know the difference between empty & missing. 

I am pretty sure the IBM DFDL behaviour deviation around empty/missing can 
be encapsulated by just this.

Regards
 
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday 




From:   Steve Hanson/UK/IBM
To:     Mike Beckerle <mbeckerle.dfdl at gmail.com>
Cc:     DFDL-WG <dfdl-wg at ogf.org>
Date:   09/05/2019 08:12
Subject:        Re: [DFDL-WG] Action 306 - IBM DFDL behaviour when parsing 
empty strings


I prefer

dfdl:emptyElementParsePolicy = ( "treatAsMissing" | "treatAsEmpty" ) 

You have to understand the difference between empty and missing in DFDL. 

It has an effect on all types - for example, if you set "treatAsMissing" 
for a required number, it means empty always causes a processing error 
instead of potentially applying a default.

Regards
 
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday 




From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     Steve Hanson <smh at uk.ibm.com>
Cc:     DFDL-WG <dfdl-wg at ogf.org>
Date:   08/05/2019 19:26
Subject:        Re: [DFDL-WG] Action 306 - IBM DFDL behaviour when parsing 
empty strings



I suggest we stick with the "...Policy" naming convention for new things 
that control modes of behavior.

I'd prefer to avoid the terms empty and missing in the property values and 
go with something that is more explanatory of what difference it makes.

E.g, emptyElementParsePolicy with values 
"excludeEmptyStringAndHexBinaryValues" and 
"allowEmptyStringAndHexBinaryValues"

The doc for these values will of course have to be in terms of 
Absent/Missing/Empty, etc. but at least the names give some intuition as 
to what they control  without having to understand all of DFDL's nuances 
about the difference between what Absent and Missing is.

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject to the OGF Intellectual Property Policy



On Wed, May 8, 2019 at 12:52 PM Steve Hanson <smh at uk.ibm.com> wrote:
Maybe this is better; 

dfdl:parseEmptyAsMissing = yes | no 

Regards
 
Steve Hanson 
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday 



From:        Mike Beckerle <mbeckerle.dfdl at gmail.com> 
To:        Steve Hanson <smh at uk.ibm.com> 
Cc:        DFDL-WG <dfdl-wg at ogf.org> 
Date:        08/05/2019 16:48 
Subject:        Re: [DFDL-WG] Action 306 - IBM DFDL behaviour when parsing 
empty strings 



Interesting. Many DFDL schemas I've created have a simpleType defintion 
named "nzString" which is string, plus an assertion that it is non-empty. 
That's to achieve exactly the behavior you have in IBM DFDL, because, as 
you say, many formats want this. 

We could rename the suggested property emptyElementParsePolicy to make it 
clear it is only about parsing. 

I like treatAsMissing. Easy to say what it means. 
treatAsEmpty begs the question of what empty elements do, but that's 
already complicated in the spec due to optionals and EVDP, so I'm happy 
with this also. 

...mikeb 


Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com 
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject to the OGF Intellectual Property Policy 



On Tue, May 7, 2019 at 3:48 AM Steve Hanson <smh at uk.ibm.com> wrote: 
Hi Mike 

I think what you have highlighted is that there are formats which require 
that empty elements should not be treated as empty but as missing, which 
is effectively what IBM DFDL is doing (our code was written prior to 
action 140 when there was no distinction between empty & missing). That 
could be achieved with assertions. So maybe we should view the new 
property as a convenience property for such formats, as well as handling 
IBM DFDL's behaviour? 

If so, then can I suggest new names for the enums, which I think makes the 
intent clearer?   

        dfdl:emptyElementPolicy = ( "treatAsMissing" | "treatAsEmpty" ) 

This only applies when parsing, maybe names should reflect that also? 

Further, "treatAsMissing" would imply that a default value was never used 
when parsing, as they are only used when the representation is empty.  I 
think we can do away with the SDE clause for "treatAsMissing". The clause 
is only needed for "treatAsEmpty". 

IBM DFDL does implement nillable processing, including use of ES as nil 
literal value.   

Regards
 
Steve Hanson 
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday 



From:        Mike Beckerle <mbeckerle.dfdl at gmail.com> 
To:        Steve Hanson <smh at uk.ibm.com> 
Cc:        DFDL-WG <dfdl-wg at ogf.org> 
Date:        03/05/2019 21:23 
Subject:        Re: [DFDL-WG] Action 306 - IBM DFDL behaviour when parsing 
empty strings 




Under testing with the EDIFACT schema (from DFDLSchemas on github) against 
new code in daffodil,  I see that my proposal was not sufficient. 
Steve Hanson stated that IBM DFDL current behavior for required empty 
strings includes "An empty occurrence with no default gives a Processing 
Error." 

I misinterpreted this. I was thinking required occurrence of an array 
element (as in with index <= minOccurs). But this should not be 
interpreted that narrowly, but any required occurrence at all including 
scalar elements. The EDIFACT schema depends on this behavior and 
backtracking driven by it, in order to work. 

So my suggestion for new properties to control this is revised to: 

dfdl:emptyElementPolicy enum with values 

noEmptyElements  - matches current IBM DFDL behavior where 
* required elements without default values that are empty (specifically 
which satisfy the empty syntax - defined below) always cause Processing 
Errors.  
** If a default value is specified that is provided as the value instead. 
When a default value is specified, then implementations that don't support 
default values when parsing must issue a runtime SDE here, not a 
processing error. 
* optional elements which satisfy the empty syntax are not added to the 
infoset. Defaulting is never considered. 

emptyElements - matches current description in the DFDL spec where 
* required elements:  if the string/hexBinary satisfies the empty syntax  
then required elements are created with an empty string or empty hexBinary 
as their value. If a default value is specified that is substituted as the 
value instead. When a default value is specified, then implementations 
that don't support default values when parsing must issue a SDE here, not 
a processing error. 
* optional elements: if the string/hexBinary satisfies the empty syntax, 
and emptyValueDelimiterPolicy is not 'none' then an empty string (or 
hexbinary) is added to the infoset. If emptyValueDelimiterPolicy is 
'none', nothing is added to the infoset. 

The term "satisfy the empty syntax" means what is found in the data stream 
may require initiator and/or terminator depending on 
emptyValueDelimiterPolicy, but if that is 'none' then this is satisfied 
just by empty string (or no bytes for hexBinary). 

Having said the above, I believe we also have to consider nillable 
elements. 

There are two topics: 

1) defaulting to nilled - For the case of a nillable element, where the 
data syntax does NOT match the nil representation, then in the above 
anywhere a default value is specified, and there is behavior associated 
with that, well if the element is nillable, and 
dfdl:useNilAsDefault='true' is specified,  then the element is default 
valued to being nilled. When nillable and dfdl:useNilAsDefault='true' is 
specified,  then implementations that don't support defaulting to nilled 
when parsing must issue an SDE here, not a processing error. 

That takes care of the defaulting aspect of nillables. 

The second topic is: 

2) nillable, and dfdl:nilValue contains %ES; as one of the possible nil 
representations. Hence, there is the possibility of empty string (or empty 
hexBinary) matching the nil representation. 

I think the DFDL spec is clear here that if the data stream satisfies the 
nil syntax, then required or optional, you get a nilled element, period. 

Does IBM DFDL implement that behavior?  If so great. If not I think we may 
have to amend the above description of noEmptyElements case for 
dfdl:emptyElementPolicy to specify the special cases. 

...mikeb 

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com 
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject to the OGF Intellectual Property Policy 



On Sun, Apr 28, 2019 at 9:36 AM Mike Beckerle <mbeckerle.dfdl at gmail.com> 
wrote: 
One clarification: is the IBM DFDL behavior the same for empty hexBinary 
elements as it is for text strings? 

I'm going to suggest we need a policy property e.g., 

dfdl:emptyElementPolicy which is an enum with at least these options: 

noOptionalEmptyElements  - matches current IBM DFDL behavior 
optionalEmptyElementsWithSyntax - matches current description in the DFDL 
spec where initiator and/or terminator found triggers creation of an empty 
string value. (Daffodil implements this.) 

This would apply (I think) to both types xs:string ad xs:hexBinary 

I'm open to suggestions for better naming for the property and the 
property values, but these are the two settings we need I think. 

I do believe that the latter optionalEmptyElementsWithSyntax behavior is 
what the DFDL spec describes, and is most consistent given the available 
properties such as emptyValueDelimiterPolicy. 

We can make implementation of optionalEmptyElementsWithSyntax a DFDL 
optional language feature, thereby avoiding issues of conformance with the 
DFDL standard. 


Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com 
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject to the OGF Intellectual Property Policy 



On Fri, Apr 5, 2019 at 12:43 PM Steve Hanson <smh at uk.ibm.com> wrote: 
Daffodil to perform identical tests but the belief is that they implement 
the spec as published (except maybe for one bug with default values for 
strings). 

So there is a mis-match between Daffodil and IBM DFDL.  It sounds like a 
new property is going to be needed which toggles the way that empty 
strings are handled. 

Regards
 
Steve Hanson 
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday 



From:        Steve Hanson/UK/IBM 
To:        DFDL-WG <dfdl-wg at ogf.org> 
Cc:        "Mike Beckerle" <mbeckerle at tresys.com>, "Michele Zundo" <
michele.zundo at esa.int>, Bradd Kadlecik/Poughkeepsie/IBM at IBMUS 
Date:        03/04/2019 12:04 
Subject:        Action 306 - IBM DFDL behaviour when parsing empty strings 


306
Confirm IBM DFDL behaviour when parsing empty strings (Steve) 
7/8: IBM DFDL has not fully implemented the behaviour changes arising from 
action 140 with respect to empty string elements. Daffodil is about to do 
so. IBM DFDL users have complained about lack of defaults when parsing but 
other than that appear happy. Are the rules in the spec for empty strings 
over complicated?  Steve to document the behaviour for IBM DFDL to inform 
the discussion. 
... 
1/11: In progress - there are a lot of subtle scenarios 
15/11: Not discussed 
... 
7/2/19: No further progress




Some progress :) 
9.4.2.2        Simple element (xs:string or xs:hexBinary) 
Required occurrence: If the element has a default value then an item is 
added to the infoset using the default value, otherwise an item is added 
to the Infoset using empty string (type xs:string) or empty hexBinary 
(type xs:hexBinary) as the value. 
Optional occurrence: If dfdl:emptyValueDelimiterPolicy is not 'none' then 
an item is added to the Infoset using empty string (type xs:string) or 
empty hexBinary (type xs:hexBinary) as the value, otherwise nothing is 
added to the Infoset. 

IBM DFDL behaviour: 

Required. IBM DFDL does not implement default values when parsing, so an 
empty occurrence with a default value gives an SDE (to prevent 
backtracking). An empty occurrence with no default gives a Processing 
Error. If you need to add an empty string to the infoset, you can add 
default=""(when default values implemented, of course). 

Optional. IBM DFDL adds nothing to the infoset regardless of presence of 
initiator and/or terminator. No way to get empty string into the infoset. 
9.4.2.3        Complex element 
Required occurrence: An item is added to the Infoset. 
Optional occurrence: If dfdl:emptyValueDelimiterPolicy is not 'none' then 
an item is added to the Infoset, otherwise nothing is added to the 
Infoset. 
For both required and optional occurrences, the Infoset item may also have 
a child item. 
 1.        If the first child element of the complex type is a required 
simple element, then an empty string (type xs:string), empty hexBinary 
(type xs:hexBinary), or default value will also be added to the Infoset. 
 2.        If the first child element of the complex type is a required 
complex element, then an item is added to the Infoset (which may itself 
have a child via (1)) 

IBM DFDL behaviour: 

Required. IBM DFDL follows the spec (modulo 1 when an error would have 
been thrown, as per its 9.4.2.2 behaviour). 

Optional. IBM DFDL follows the spec (modulo 1 when an error would have 
been thrown, as per its 9.4.2.2 behaviour). 


So ... 

The spec today is consistent in one way, in that for both complex & string 
elements a) a required empty occurrence always adds to the infoset; & b) 
an optional empty occurrence adds to the infoset if initiator/terminator 
present; & c) an optional empty occurrence does not add to the infoset if 
no initiator/terminator present. 

If the simple string behaviour was to change to match IBM DFDL then that 
consistency is lost, but the string behaviour then matches that for other 
simple types.  Section 9.4.2.2 disappears as the behaviour is same as 
9.4.2.1. Section 9.4.2.3 becomes as below. We lose the ability to get an 
empty string into the infoset for an optional string with 
initiator/terminator. 
9.4.2.3        Complex element 
Required occurrence: An item is added to the Infoset. 
Optional occurrence: If dfdl:emptyValueDelimiterPolicy is not 'none' then 
an item is added to the Infoset, otherwise nothing is added to the 
Infoset. 
For both required and optional occurrences, the Infoset item may also have 
a child item. 
 1.        If the first child element of the complex type is a required 
simple element, then a default value will also be added to the Infoset. 
 2.        If the first child element of the complex type is a required 
complex element, then an item is added to the Infoset (which may itself 
have a child via (1)) 

We also need to be sure that any other implementations have not yet 
implemented the current spec behaviour.  Need to check with DFDL4S and IBM 
TPF. 

To be discussed on next WG call ... 

Regards
 
Steve Hanson 
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday 
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20190625/bc4137c4/attachment-0001.html>


More information about the dfdl-wg mailing list