[DFDL-WG] clarification needed - ambiguity about empty string and optional element
Steve Hanson
smh at uk.ibm.com
Thu Aug 2 12:08:23 EDT 2018
First thing to note is that 'anyEmpty' means the sequence is
non-positional, and in such a sequence I would expect initiators to be
defined.
EmptyValueDelimiterPolicy not relevant as no initiator or terminator.
"Since the 'y' element decl does not specify a XSD default value, the
concept of 'empty' and defaulting doesn't apply here". Not correct. The
concept of empty applies; defaulting happens if empty & required & default
set.
For your "foo;" example, the infoset should not contain </y> because y is
optional & empty & does not have initiator (spec 9.4.2.2):
Optional occurrence: If dfdl:emptyValueDelimiterPolicy is not 'none' then
an item is added to the Infoset using empty string (type xs:string) or
empty hexBinary (type xs:hexBinary) as the value, otherwise nothing is
added to the Infoset.
I think that the sentence can be clarified to say:
Optional occurrence: If dfdl:emptyValueDelimiterPolicy is applicable and
not 'none' then an item is added to the Infoset using empty string (type
xs:string) or empty hexBinary (type xs:hexBinary) as the value, otherwise
nothing is added to the Infoset.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From: Mike Beckerle <mbeckerle.dfdl at gmail.com>
To: dfdl-wg at ogf.org
Date: 01/08/2018 19:42
Subject: Re: [DFDL-WG] clarification needed - ambiguity about empty
string and optional element
Sent by: "dfdl-wg" <dfdl-wg-bounces at ogf.org>
I omitted that dfdl:emptyValueDelimiterPolicy is 'both' here, though no
dfdl:initiator nor dfdl:terminator are defined.
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
On Wed, Jul 11, 2018 at 8:16 AM, Mike Beckerle <mbeckerle.dfdl at gmail.com>
wrote:
Consider this data of 4 characters:
foo;
Consider this schema where the default format is the basic general set of
text-oriented defaults.
<xs:element name="ex_infix" dfdl:lengthKind="implicit">
<xs:complexType>
<xs:sequence dfdl:separator=";"
dfdl:separatorSuppressionPolicy="anyEmpty" dfdl:separatorPosition="infix">
<xs:element name="x" type="xs:string" dfdl:lengthKind="delimited"/>
<xs:element name="y" type="xs:string" minOccurs="0"
dfdl:lengthKind="delimited"
dfdl:occursCountKind="implicit"/>
</xs:sequence>
</xs:complexType>
</xs:element>
This is in a current Daffodil unit test, and produces this infoset:
<ex_infix><x>foo</x><y/></ex_infix>
That is, an empty string element is created for element 'y'.
I'd like to know what IBM DFDL produces as the infoset for this example.
I believe the DFDL spec is actually self-contradictory and so ambiguous
here about what is the right behavior.
DFDL Spec 14.2.1 description of anyEmpty: "...any occurrences that have
zero length representation MAY be omitted from the data, along with their
associated separator."
Note that it says "may", not "must be". So anyEmpty is "lax" in insisting
that the zero-length elements aren't present.
This doesn't clarify anything for us. But it admits the possibility that
the ";" separator appears even if the 'y' element occurrence is determined
to not exist.
DFDL Spec 9.3.1.1 says an element is known to exist if it has the nil,
empty, or normal representation
In the example, element 'y' is zero-length which is either empty or normal
representation since a string can have "" (empty string) as a value.
Since the 'y' element decl does not specify a XSD default value, the
concept of 'empty' and defaulting doesn't apply here, so a zero-length
string is a normal representation, and according to this section, it is
known-to-exist.
This contradicts 9.4.2.2 below.
DFDL Spec 9.3.1.3 says "Note: based on the above, when processing a
sequence for which a separator is defined, the presence of a match in the
data for the separator is not sufficient to cause the parser to determine
that an associated component is known-to-exist." It then refers you to
14.2.1
I don't think this changes anything. Again it just admits that the
separator ";" can appear even without the following element. I.e., I think
it just allows for lax processing of excess separators.
DFDL Spec 9.4.2 Element Defaults When Parsing - Subsection
9.4.2.2 Simple element (xs:string or xs:hexBinary) (Emphasis below
is mine)
Here's the excerpted text:
"Required occurrence: If the element has a default value then an item is
added to the infoset using the default value, otherwise an item is added
to the Infoset using empty string (type xs:string) or empty hexBinary
(type xs:hexBinary) as the value. Optional occurrence: If
dfdl:emptyValueDelimiterPolicy is not 'none'[12] then an item is added to
the Infoset using empty string (type xs:string) or empty hexBinary (type
xs:hexBinary) as the value, otherwise nothing is added to the Infoset.
Note: To prevent unwanted empty strings or empty hexBinary values from
being added to the Infoset, use XSD minLength > '0' and a dfdl:assert that
uses the dfdl:checkConstraints() function, to raise a processing error."
Note that the language states "if the element has a default value" - which
denotes that the section is dealing with both defaultable AND
non-defaultable elements, and is not exclusively discussing defaultable
elements as the title of 9.4.2 would imply.
The second statement is about optional occurrences, and it does not
qualify what it says on defaultable element or not. Hence, I read the
"nothing is added to the infoset" as applies whether or not the element is
defaultable. So a zero length (ZL) string is never going to create an
empty-string value for an optional element.
However, this contradicts the note about preventing unwanted empty
strings. That note is only sensible if optional elements of zero-length
will get added to the infoset and extra steps are required to force a
facet check to prevent them.
Unless I'm missing another place in the DFDL spec that clarifies this, I
think we need to revise this area to make things clearer.
But first we have to pick which is the intended semantics. In the example
above, which infoset is the one we want:
<ex_infix><x>foo</x><y/></ex_infix> (empty string as normal
representation takes priority over optionality)
or
<ex_infix><x>foo</x></ex_infix> (optionality takes priority over empty
string as normal representation)
Either way I think this change is needed:
Section 9.4.2 - change section title to "Element Defaults and Optionality
When Parsing"
But a bunch of other clarifications are also needed.
Today Daffodil 2.1.0 implements the first behavior.
<ex_infix><x>foo</x><y/></ex_infix> with the empty 'y' element.
What does IBM DFDL do?
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
--
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20180802/1328e51a/attachment.html>
More information about the dfdl-wg
mailing list