[DFDL-WG] Problem: simple format that is impossible to model

Tue Oct 1 10:14:40 EDT 2019

OK so I think the motivating example can be described as follows:

1) CSV style format
2) Only delimiters are separators
3) There are optional fields that occur beyond the last required field *
4) Empty string is a considered a normal value that needs preserving for 
such an optional field
5) Nil value is already being used for something else **

* Otherwise you just make all fields required and use a default value of 
empty string
** Otherwise you use a nil default value of empty string.

IBM DFDL has been operating in a world of CSV and other delimited formats 
for nearly 8 years, and I've not come across this requirement in reality. 
There is usually no distinction between an omitted value and empty string 
in CSV style formats where the field is optional.

I would prefer that this was deferred until DFDL 2.0. Meanwhile we can 
design the proposed new dfdlx:emptyElementParsePolicy so it can be easily 
extended.

Regards

Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday 

From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     Steve Hanson <smh at uk.ibm.com>
Cc:     DFDL-WG <dfdl-wg at ogf.org>
Date:   27/09/2019 19:20
Subject:        Re: [DFDL-WG] Problem: simple format that is impossible to 
model

Yes there is the nillable technique, but my simplified example data format 
was too simplified.

in the real format that I derived my simple example from, nillability is 
used for other purposes. In that format generally elements are nillable 
with nilValue="%WSP*;-%WSP*;".  That is, the format needs to distinguish 
explicitly nilled values from string values, including empty string 
values. 

I know I'm not the only user who thinks one should be able to model this 
simple data format without needing to use nillability. 

For example, if you look at the CSV schema on DFDLSchemas on github, the 
elements for rows of data are not nillable, even though adjacent commas 
are routine in CSV files. 
Of course a CSV schema for a fixed-number-of-columns doesn't have a 
variable number of elements in the rows, so the elements in the rows are 
all required, not optional.  Still I think you can't tolerate adjacent 
commas without using nillability if you want the data to both parse and 
unparse.  I have wanted to enhance the CSV schema on github to show more 
variations on the CSV-like theme for a while, because I have recently 
created many CSV-like data schemas, and a common theme to them is that 
there are a variety of representations of nilled such as "N/A none -" 
(these were human-created spreadsheet 'documents' exported as CSV, not 
machine-generated CSV data sets),  and in some of these empty strings are 
legit "normal" values. I have had the good fortune that these formats were 
parse-only, as they would not have faithfully unparsed. 

The problem ultimately boils down to there is no way in DFDL to say "treat 
empty strings as just normal strings".The use of initiators/terminators 
and dfdl:emptyValueDelimiterPolicy="both" doesn't fix this, because that 
doesn't give you a NormalRep, it gives you EmptyRep. 

As well there is ambiguity in the spec between the sections 9.2.5 and 
9.2.3 - 9.2.4, as to whether zero-length string/hexBinary with no framing 
is NormalRep or EmptyRep. 

The fact that we have a property named dfdl:emptyValueDelimiterPolicy 
suggests that an element, regardless of type, is EmptyRep if the content 
is zero length and the initiators/terminators match the EVDP policy.
That suggests that section 9.2.5 is simply incorrect - a NormalRep cannot 
be zero-length for string or hexBinary if there is no framing. Such would 
always be an EmptyRep.
That would leave the nillable mechanism as the only way to deal with 
zero-length strings that need to be retained in the infoset. 

While it is good to fix that ambiguity, I find this not really an adequate 
solution. I can't deal with my slash-delimited format that uses nillable 
for other purposes in any reasonable way. I need a way to say "treat 
zero-length strings as normal values". 

I suggest we modify the recently proposed dfdlx:emptyElementParsePolicy 
property to encompass the added variation we need. So the values of the 
property would be:
treat zero-length for all types as AbsentRep always (we were calling this 
"treatAsMissing", or "treatAsAbsent" - this is the IBM DFDL behavior today 
as I understand it.)
treat zero-length for all types as EmptyRep always (we were calling this 
"treatAsEmpty" - this is the DFDL Spec behavior as written today as 
revised by current errata and with the correction mentioned above to 
remove the ambiguity.)
treat zero-length for string/hexBinary as NormalRep, all other types as 
EmptyRep (Suggest  "treatAsNormalOrEmpty". The rationale for this enum 
name is since the other types than string/hexBinary can't have zero length 
NormalRep, they must be EmptyRep. I read the enum name as "treat as 
NormalRep when possible otherwise treat as EmptyRep".  Another possible 
enum name might be "preferNormalToEmpty".)
Section 9.2.5 would be clarified to say that zero-length NormalRep is 
possible for string/hexBinary if there is no framing and 
dfdlx:emptyElementParsePolicy is 'treatAsNormalOrEmpty'. 

Sections 14.2.2 and 14.2.3 may need a one-line clarification added that 
when zero-length string/hexBinary is being treated as NormalRep, then they 
are "normal" not "empty", and since they are not EmptyRep suppression of 
zero-length and separators would not occur for trailingEmpty, 
trailingEmptyStrict, or anyEmpty. (Which should be intuitive given the 
enum names use the word "Empty")

It would/could be an SDE (or maybe warning) if this latter 
"treatAsNormalOrEmpty" was specified for a potentially required element 
(scalar or minOccurs > 0) of type string or hexBinary of variable length 
(so possibly zero) with a default specified other than default="", because 
such a default value could never be used, as zero length would be 
considered NormalRep and so would not trigger use of the default value. 
I.e., SDE like "Default value for element X can never be used because...."

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject to the OGF Intellectual Property Policy

On Fri, Sep 27, 2019 at 4:39 AM Steve Hanson <smh at uk.ibm.com> wrote:
As there are no initiators or terminators, and your example infoset calls 
everything 'field', I am assuming that the element looks logically like: 

<xs:element name="field" type="xs:string" minOccurs="0" 
maxOccurs="unbounded" /> 

You want to preserve the position of the occurrences in the infoset so 
that they re-appear on output. The agreed way to do this is: 

<xs:element name="field" type="xs:string" minOccurs="0" 
maxOccurs="unbounded" nillable="true" dfdl:nilKind="literalValue" 
dfdl:nilValue="%ES;" /> 

Regards

Steve Hanson 
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday 

From:        Mike Beckerle <mbeckerle.dfdl at gmail.com> 
To:        DFDL-WG <dfdl-wg at ogf.org> 
Date:        26/09/2019 19:11 
Subject:        Re: [DFDL-WG] Problem: simple format that is impossible to 
model 
Sent by:        "dfdl-wg" <dfdl-wg-bounces at ogf.org> 

To start discussion on my own issue..... 

The problem here may be that for a string (or hexBinary), if there is no 
initiator/terminator, there is no way to distinguish EmptyRep from 
NormalRep. I.e., an empty string is a "normal" value for a string. 

Sections 9.2.3 and 9.2.4 seem to define EmptyRep and NormalRep such that 
an empty string will be a EmptyRep, not a NormalRep. 

However section 9.2.5 on zero-length says: 

   "The normal representation can be a zero-length representation if the 
type is xs:string or xs:hexBinary and there is no framing." 

That suggests that when there is no framing, a zero-length string is 
NormalRep, not EmptyRep, which is the opposite conclusion from what is in 
sections 9.2.3 and 9.2.4. 

If this latter clarification is correct, then my format *should* work as I 
expect, because the empty string elements will be considered NormalRep and 
infoset values will be created for them. 
It simply doesn't work because of a bug in daffodil which has not 
interpreted this correctly. 

...mikeb 

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com 
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject to the OGF Intellectual Property Policy 

On Thu, Sep 26, 2019 at 1:47 PM Mike Beckerle <mbeckerle.dfdl at gmail.com> 
wrote: 
I have a dead-simple little format: 

    data/data/data/data 
    data/data/data/data 

it is lines of "/" separated strings. All elements are optional. 

I simply want this: 

   data//data 

to round trip. For that to happen I need it to parse into 

   <field>data</field><field></field><field>data</field> 

That is, I require that empty field element in the middle to be created 
and put into the infoset. 

I can find no way to do this. 

The strings have no initiator/terminator, so 
dfdl:emptyValueDelimiterPolicy is not relevant. All the elements are 
optional, so default values aren't relevant. 

The spec states: 

9.4.2.2      Simple element (xs:string or xs:hexBinary) 
Required occurrence: If the element has a default value then an item is 
added to the infoset using the default value, otherwise an item is added 
to the Infoset using empty string (type xs:string) or empty hexBinary 
(type xs:hexBinary) as the value. 
Optional occurrence: If dfdl:emptyValueDelimiterPolicy is not 'none'[12]
 then an item is added to the Infoset using empty string (type xs:string) 
or empty hexBinary (type xs:hexBinary) as the value, otherwise nothing is 
added to the Infoset. 

There are errata/actions to clarify wording here around 
dfdl:emptyValueDelimiterPolicy being in effect or not (because there is no 
initiator/terminator for it to use as opposed to the property in isolation 
just being 'none'). 
But that doesn't change anything about this issue. 

If this very simple format is not possible, then we need a property or new 
property enum value that makes it possible. 

Thoughts? 

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com 
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject to the OGF Intellectual Property Policy 
--
 dfdl-wg mailing list
 dfdl-wg at ogf.org
 https://www.ogf.org/mailman/listinfo/dfdl-wg 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20191001/0d52b92c/attachment-0001.html>