[DFDL-WG] Arrays with empty elements
Tim Kimber
KIMBERT at uk.ibm.com
Mon Feb 14 07:00:25 CST 2011
Consider the following schema:
<xs:element name="array" minOccurs="1" maxOccurs="1">
<xs:complexType>
<xs:sequence dfdl:sequenceKind="ordered" dfdl:separatorPosition=
"infix" dfdl:separatorPolicy="required" dfdl:separator=",">
<xs:element name="array_item" type="xs:string" minOccurs="2"
maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
Allowed data streams and the resulting info sets ( rendered as XML ) are:
item_value,item_value
<array>
<array_item>item_value<array_item>
<array_item>item_value<array_item>
</array>
item_value,
<array>
<array_item>item_value<array_item>
</array>
,item_value
<array>
<array_item>item_value<array_item>
</array>
,
<array>
</array>
Notice rows 2 and 3. The parser has applied the rules in the DFDL
specification, and has treated the zero-length elements as 'missing'.
Furthermore, these missing elements are not required, so they are omitted
from the info set. This is not good - the receiver of the info set has no
way to reliably determine whether the array_item was the first or second
item in the array. If presented to the DFDL serializer, both info sets
will produce the data stream for row 2.
Note that this is a problem only for arrays. A sequence of
differently-named optional elements will not be ambiguous because the
element names in the info set can be used to determine which elements were
present in the data.
Possible fixes:
a) Change the definition of 'required' from 'all occurrences up to
minOccurs' to 'all occurrences before the final non-missing occurrence'.
In scenarios like the one above, non-required occurrences would be put
into the infoset with a default value ( assuming that a default was
defined in the model ).
b) provide a dfdl property that controls whether elements with zero-length
content are treated as missing.
The presence of one or more delimiters ( a separator or iniitator or
terminator ) implies that an element is present in the data. Currently,
DFDL unconditionally treats an element as 'missing' if its content region
is zero-length - regardless of whether there were any delimiters for that
element.
In this scenario, if the parser acted on that information then the info
sets would be distinguishable. Suggested name for the property would be
'dfdl:emptyValueMissingPolicy' with values 'missing' and 'included'.
a) would require the parser to keep track of the last-reported occurrence
of an array element. When a non-missing occurrence was encountered it
would have to put any previously-skipped non-required occurrences into the
infoset first.
An example might help: one,,,four
Occurences 2 and 3 would be omitted from the infoset because they are
zero-length. Upon ecountering occurrence 4, the parser would have to put
occurrence 2 and 3 into the infoset with the xs:default value before
putting 4 into the infoset.
regards,
Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert at uk.ibm.com
Tel. 01962-816742
Internal tel. 246742
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20110214/55ff726f/attachment.html
More information about the dfdl-wg
mailing list