[DFDL-WG] clarification needed - ambiguity about empty string and optional element

Mike Beckerle mbeckerle.dfdl at gmail.com
Wed Jul 11 08:16:05 EDT 2018


Consider this data of 4 characters:


foo;


Consider this schema where the default format is the basic general set of
text-oriented defaults.


<xs:element name="ex_infix" dfdl:lengthKind="implicit">

  <xs:complexType>

    <xs:sequence dfdl:separator=";"
dfdl:separatorSuppressionPolicy="anyEmpty" dfdl:separatorPosition="infix">

       <xs:element name="x" type="xs:string" dfdl:lengthKind="delimited"/>

       <xs:element name="y" type="xs:string" minOccurs="0"

          dfdl:lengthKind="delimited"

          dfdl:occursCountKind="implicit"/>

   </xs:sequence>

 </xs:complexType>

</xs:element>



This is in a current Daffodil unit test, and produces this infoset:


<ex_infix><x>foo</x><y/></ex_infix>


That is, an empty string element is created for element 'y'.


I'd like to know what IBM DFDL produces as the infoset for this example.


I believe the DFDL spec is actually self-contradictory and so ambiguous
here about what is the right behavior.



   - DFDL Spec 14.2.1 description of anyEmpty: "...any occurrences that
   have zero length representation MAY be omitted from the data, along with
   their associated separator."
      - Note that it says "may", not "must be". So anyEmpty is "lax" in
      insisting that the zero-length elements aren't present.
      - This doesn't clarify anything for us. But it admits the possibility
      that the ";" separator appears even if the 'y' element occurrence is
      determined to not exist.



   - DFDL Spec 9.3.1.1 says an element is known to exist if it has the nil,
   empty, or normal representation
      - In the example, element 'y' is zero-length which is either empty or
      normal representation since a string can have "" (empty string)
as a value.
      - Since the 'y' element decl does not specify a XSD default value,
      the concept of 'empty' and defaulting doesn't apply here, so a
zero-length
      string is a normal representation, and according to this section, it is
      known-to-exist.
      - This contradicts 9.4.2.2 below.



   - DFDL Spec 9.3.1.3 says "Note: based on the above, when processing a
   sequence for which a separator is defined, the presence of a match in the
   data for the separator is not sufficient to cause the parser to determine
   that an associated component is known-to-exist." It then refers you to
   14.2.1
      - I don't think this changes anything. Again it just admits that the
      separator ";" can appear even without the following element.
I.e., I think
      it just allows for lax processing of excess separators.



   - DFDL Spec 9.4.2 Element Defaults When Parsing - Subsection 9.4.2.2
         Simple element (xs:string or xs:hexBinary)  (Emphasis below is
   mine)
   - Here's the excerpted text:
      -
         - "Required occurrence:* If the element has a default value* then
         an item is added to the infoset using the default value,
otherwise an item
         is added to the Infoset using empty string (type xs:string) or empty
         hexBinary (type xs:hexBinary) as the value. Optional occurrence:
         If dfdl:emptyValueDelimiterPolicy is not 'none'[12]
         <http://daffodil.apache.org/docs/dfdl/#_ftn12> then an item is
         added to the Infoset using empty string (type xs:string) or
empty hexBinary
         (type xs:hexBinary) as the value, *otherwise nothing is added to
         the Infoset. *

      Note: *To prevent unwanted empty strings *or empty hexBinary values
      from being added to the Infoset, use XSD minLength > '0' and a
dfdl:assert
      that uses the dfdl:checkConstraints() function, to raise a processing
      error."
      - Note that the language states "if the element has a default value"
      - which denotes that the section is dealing with both defaultable AND
      non-defaultable elements, and is not exclusively discussing defaultable
      elements as the title of 9.4.2 would imply.
      - The second statement is about optional occurrences, and it does not
      qualify what it says on defaultable element or not. Hence, I read the
      "nothing is added to the infoset" as applies whether or not the
element is
      defaultable. So a zero length (ZL) string is never going to create an
      empty-string value for an optional element.
      - However, this contradicts the note about preventing unwanted empty
      strings. That note is only sensible if optional elements of zero-length
      will get added to the infoset and extra steps are required to
force a facet
      check to prevent them.


Unless I'm missing another place in the DFDL spec that clarifies this, I
think we need to revise this area to make things clearer.


But first we have to pick which is the intended semantics. In the example
above, which infoset is the one we want:


    <ex_infix><x>foo</x><y/></ex_infix> (empty string as normal
representation takes priority over optionality)

or

    <ex_infix><x>foo</x></ex_infix> (optionality takes priority over empty
string as normal representation)


Either way I think this change is needed:

   - Section 9.4.2 - change section title to "Element Defaults and
   Optionality When Parsing"

But a bunch of other clarifications are also needed.

Today Daffodil 2.1.0 implements the first behavior.
<ex_infix><x>foo</x><y/></ex_infix> with the empty 'y' element.

What does IBM DFDL do?









Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
<http://www.ogf.org/About/abt_policies.php>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20180711/4944cb89/attachment-0001.html>


More information about the dfdl-wg mailing list