[DFDL-WG] Problem: simple format that is impossible to model

Mike Beckerle mbeckerle.dfdl at gmail.com
Tue Oct 1 18:36:36 EDT 2019


To be clear, the example really is real. It actually comes from a format
called USMTF which is US mil-std-6040, NATO STANAG 5500.

I am ok to leave this until DFDL 2.0 and do experimental implementations in
the mean time.

We still have a bug then in DFDL spec section 9.2.5 where it suggests
normal rep for string/hexBinary can be zero-length if there is no framing.
This is simply false I believe. ZL for a string or hexBinary has to be
empty rep or nil rep.

-mike beckerle


On Tue, Oct 1, 2019 at 10:14 AM Steve Hanson <smh at uk.ibm.com> wrote:

> OK so I think the motivating example can be described as follows:
>
> 1) CSV style format
> 2) Only delimiters are separators
> 3) There are optional fields that occur beyond the last required field *
> 4) Empty string is a considered a normal value that needs preserving for
> such an optional field
> 5) Nil value is already being used for something else **
>
> * Otherwise you just make all fields required and use a default value of
> empty string
> ** Otherwise you use a nil default value of empty string.
>
> IBM DFDL has been operating in a world of CSV and other delimited formats
> for nearly 8 years, and I've not come across this requirement in reality.
> There is usually no distinction between an omitted value and empty string
> in CSV style formats where the field is optional.
>
> I would prefer that this was deferred until DFDL 2.0. Meanwhile we can
> design the proposed new dfdlx:emptyElementParsePolicy so it can be easily
> extended.
>
> Regards
>
> Steve Hanson
>
> IBM Hybrid Integration, Hursley, UK
> Architect, *IBM DFDL*
> <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> *smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
> mob:+44-7717-378890
> Note: I work Tuesday to Friday
>
>
>
> From:        Mike Beckerle <mbeckerle.dfdl at gmail.com>
> To:        Steve Hanson <smh at uk.ibm.com>
> Cc:        DFDL-WG <dfdl-wg at ogf.org>
> Date:        27/09/2019 19:20
> Subject:        Re: [DFDL-WG] Problem: simple format that is impossible
> to model
> ------------------------------
>
>
>
> Yes there is the nillable technique, but my simplified example data format
> was too simplified.
>
> in the real format that I derived my simple example from, nillability is
> used for other purposes. In that format generally elements are nillable
> with nilValue="%WSP*;-%WSP*;".  That is, the format needs to distinguish
> explicitly nilled values from string values, including empty string values.
>
> I know I'm not the only user who thinks one should be able to model this
> simple data format without needing to use nillability.
>
> For example, if you look at the CSV schema on DFDLSchemas on github, the
> elements for rows of data are not nillable, even though adjacent commas are
> routine in CSV files.
> Of course a CSV schema for a fixed-number-of-columns doesn't have a
> variable number of elements in the rows, so the elements in the rows are
> all required, not optional.  Still I think you can't tolerate adjacent
> commas without using nillability if you want the data to both parse and
> unparse.  I have wanted to enhance the CSV schema on github to show more
> variations on the CSV-like theme for a while, because I have recently
> created many CSV-like data schemas, and a common theme to them is that
> there are a variety of representations of nilled such as "N/A none -"
> (these were human-created spreadsheet 'documents' exported as CSV, not
> machine-generated CSV data sets),  and in some of these empty strings are
> legit "normal" values. I have had the good fortune that these formats were
> parse-only, as they would not have faithfully unparsed.
>
> The problem ultimately boils down to there is no way in DFDL to say "treat
> empty strings as just normal strings".The use of initiators/terminators and
> dfdl:emptyValueDelimiterPolicy="both" doesn't fix this, because that
> doesn't give you a NormalRep, it gives you EmptyRep.
>
> As well there is ambiguity in the spec between the sections 9.2.5 and
> 9.2.3 - 9.2.4, as to whether zero-length string/hexBinary with no framing
> is NormalRep or EmptyRep.
>
> The fact that we have a property named dfdl:emptyValueDelimiterPolicy
> suggests that an element, regardless of type, is EmptyRep if the content is
> zero length and the initiators/terminators match the EVDP policy.
> That suggests that section 9.2.5 is simply incorrect - a NormalRep cannot
> be zero-length for string or hexBinary if there is no framing. Such would
> always be an EmptyRep.
> That would leave the nillable mechanism as the only way to deal with
> zero-length strings that need to be retained in the infoset.
>
> While it is good to fix that ambiguity, I find this not really an adequate
> solution. I can't deal with my slash-delimited format that uses nillable
> for other purposes in any reasonable way. I need a way to say "treat
> zero-length strings as normal values".
>
> I suggest we modify the recently proposed dfdlx:emptyElementParsePolicy
> property to encompass the added variation we need. So the values of the
> property would be:
>
>    - treat zero-length for all types as AbsentRep always (we were calling
>    this "treatAsMissing", or "treatAsAbsent" - this is the IBM DFDL behavior
>    today as I understand it.)
>    - treat zero-length for all types as EmptyRep always (we were calling
>    this "treatAsEmpty" - this is the DFDL Spec behavior as written today as
>    revised by current errata and with the correction mentioned above to remove
>    the ambiguity.)
>    - treat zero-length for string/hexBinary as NormalRep, all other types
>    as EmptyRep (Suggest  "treatAsNormalOrEmpty". The rationale for this enum
>    name is since the other types than string/hexBinary can't have zero length
>    NormalRep, they must be EmptyRep. I read the enum name as "treat as
>    NormalRep when possible otherwise treat as EmptyRep".  Another possible
>    enum name might be "preferNormalToEmpty".)
>
> Section 9.2.5 would be clarified to say that zero-length NormalRep is
> possible for string/hexBinary if there is no framing and
> dfdlx:emptyElementParsePolicy is 'treatAsNormalOrEmpty'.
>
> Sections 14.2.2 and 14.2.3 may need a one-line clarification added that
> when zero-length string/hexBinary is being treated as NormalRep, then they
> are "normal" not "empty", and since they are not EmptyRep suppression of
> zero-length and separators would not occur for trailingEmpty,
> trailingEmptyStrict, or anyEmpty. (Which should be intuitive given the enum
> names use the word "Empty")
>
> It would/could be an SDE (or maybe warning) if this latter
> "treatAsNormalOrEmpty" was specified for a potentially required element
> (scalar or minOccurs > 0) of type string or hexBinary of variable length
> (so possibly zero) with a default specified other than default="", because
> such a default value could never be used, as zero length would be
> considered NormalRep and so would not trigger use of the default value.
> I.e., SDE like "Default value for element X can never be used because...."
>
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
> *www.tresys.com* <http://www.tresys.com>
> Please note: Contributions to the DFDL Workgroup's email discussions are
> subject to the *OGF Intellectual Property Policy*
> <http://www.ogf.org/About/abt_policies.php>
>
>
>
> On Fri, Sep 27, 2019 at 4:39 AM Steve Hanson <*smh at uk.ibm.com*
> <smh at uk.ibm.com>> wrote:
> As there are no initiators or terminators, and your example infoset calls
> everything 'field', I am assuming that the element looks logically like:
>
> <xs:element name="field" type="xs:string" minOccurs="0"
> maxOccurs="unbounded" />
>
> You want to preserve the position of the occurrences in the infoset so
> that they re-appear on output. The agreed way to do this is:
>
> <xs:element name="field" type="xs:string" minOccurs="0"
> maxOccurs="unbounded" nillable="true" dfdl:nilKind="literalValue"
> dfdl:nilValue="%ES;" />
>
> Regards
>
> Steve Hanson
> IBM Hybrid Integration, Hursley, UK
> Architect, *IBM DFDL*
> <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> *smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
> mob:+44-7717-378890
> Note: I work Tuesday to Friday
>
>
>
> From:        Mike Beckerle <*mbeckerle.dfdl at gmail.com*
> <mbeckerle.dfdl at gmail.com>>
> To:        DFDL-WG <*dfdl-wg at ogf.org* <dfdl-wg at ogf.org>>
> Date:        26/09/2019 19:11
> Subject:        Re: [DFDL-WG] Problem: simple format that is impossible
> to model
> Sent by:        "dfdl-wg" <*dfdl-wg-bounces at ogf.org*
> <dfdl-wg-bounces at ogf.org>>
> ------------------------------
>
>
>
>
> To start discussion on my own issue.....
>
> The problem here may be that for a string (or hexBinary), if there is no
> initiator/terminator, there is no way to distinguish EmptyRep from
> NormalRep. I.e., an empty string is a "normal" value for a string.
>
> Sections 9.2.3 and 9.2.4 seem to define EmptyRep and NormalRep such that
> an empty string will be a EmptyRep, not a NormalRep.
>
> However section 9.2.5 on zero-length says:
>
>    "The normal representation can be a zero-length representation if the
> type is xs:string or xs:hexBinary and there is no framing."
>
> That suggests that when there is no framing, a zero-length string is
> NormalRep, not EmptyRep, which is the opposite conclusion from what is in
> sections 9.2.3 and 9.2.4.
>
> If this latter clarification is correct, then my format *should* work as I
> expect, because the empty string elements will be considered NormalRep and
> infoset values will be created for them.
> It simply doesn't work because of a bug in daffodil which has not
> interpreted this correctly.
>
> ...mikeb
>
>
>
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
> *www.tresys.com* <http://www.tresys.com>
> Please note: Contributions to the DFDL Workgroup's email discussions are
> subject to the *OGF Intellectual Property Policy*
> <http://www.ogf.org/About/abt_policies.php>
>
>
>
> On Thu, Sep 26, 2019 at 1:47 PM Mike Beckerle <*mbeckerle.dfdl at gmail.com*
> <mbeckerle.dfdl at gmail.com>> wrote:
> I have a dead-simple little format:
>
>     data/data/data/data
>     data/data/data/data
>
> it is lines of "/" separated strings. All elements are optional.
>
> I simply want this:
>
>    data//data
>
> to round trip. For that to happen I need it to parse into
>
>    <field>data</field><field></field><field>data</field>
>
> That is, I require that empty field element in the middle to be created
> and put into the infoset.
>
> I can find no way to do this.
>
> The strings have no initiator/terminator, so
> dfdl:emptyValueDelimiterPolicy is not relevant. All the elements are
> optional, so default values aren't relevant.
>
> The spec states:
>
> 9.4.2.2      Simple element (xs:string or xs:hexBinary)
> Required occurrence: If the element has a default value then an item is
> added to the infoset using the default value, otherwise an item is added to
> the Infoset using empty string (type xs:string) or empty hexBinary (type
> xs:hexBinary) as the value.
> Optional occurrence: If dfdl:emptyValueDelimiterPolicy is not 'none'*[12]*
> <https://daffodil.apache.org/docs/dfdl/#_ftn12> then an item is added to
> the Infoset using empty string (type xs:string) or empty hexBinary (type
> xs:hexBinary) as the value, *otherwise nothing is added to the Infoset*.
>
>
> There are errata/actions to clarify wording here around
> dfdl:emptyValueDelimiterPolicy being in effect or not (because there is no
> initiator/terminator for it to use as opposed to the property in isolation
> just being 'none').
> But that doesn't change anything about this issue.
>
> If this very simple format is not possible, then we need a property or new
> property enum value that makes it possible.
>
> Thoughts?
>
>
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
> *www.tresys.com* <http://www.tresys.com>
> Please note: Contributions to the DFDL Workgroup's email discussions are
> subject to the *OGF Intellectual Property Policy*
> <http://www.ogf.org/About/abt_policies.php>
> --
>  dfdl-wg mailing list
>  *dfdl-wg at ogf.org* <dfdl-wg at ogf.org>
>  *https://www.ogf.org/mailman/listinfo/dfdl-wg*
> <https://www.ogf.org/mailman/listinfo/dfdl-wg>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20191001/82ac6ff2/attachment-0001.html>


More information about the dfdl-wg mailing list