[DFDL-WG] Definition of 'missing element' - some edge cases
Tim Kimber
KIMBERT at uk.ibm.com
Tue Sep 7 05:57:39 CDT 2010
I'm going to send this and then duck - we've discussed the subject of
missing-ness and defaulting at considerable length already. However, I
genuinely do have some new information for your consideration so please
hear me out.
I'm seeking the opinion of the working group on the following questions:
a) can an element reliably be categorised as 'missing' when
separatorPolicy='suppressed'?
b) is it possible for an element to be 'missing' if it has
lengthKind='explicit' and its length is a static, non-zero value?
c) is it possible for an element to be 'missing' if it has a discriminator
that has already evaluated to 'true'.
For reference, the specification ( v0.42 ) says this concerning missing
elements:
Definition 'missing element'
On parsing, an element is missing if its content region in the data stream
is empty. The initiator and terminator regions of a missing element may,
or may not, also be empty as controlled by the
dfdl:emptyValueDelimiterPolicy property (simple and complex element), or
dfdl:nilValueDelimiterPolicy property (simple element), .
Question a),
Compare the following data streams. In both cases, assume that
- separator is comma and separatorPosition is 'infix'
- missingValueDelimiterPolicy is set to 'none' so a 'missing' value should
not have an initiator.
- the initiators are A:, B: and C:
- values are a,b,c.
separatorPolicy='required' : A:a,,C:c
separatorPolicy='suppressed' : A:a,C:c
In the 'required' case, the parser detects that the initiator is missing,
then looks to see whether the content region is zero-length. It is, so the
element is 'missing'.
In the 'suppressed' case, the parser detects that the initiator is
missing, then looks to see whether the content region is zero-length. It
looks for a delimiter at the current position and finds 'C'. 'C' is not a
delimiter, so the content region is not zero-length. So the parser throws
a processing error - "initiator for element B was not found in the data".
I don't think the 'suppressed' behaviour is what a user will expect, nor
what the WG intended when these rules were drawn up. The problem is that
the parser cannot reliably determine the length of the content region when
separatorPolicy='suppressed'. It can, however, reliably detect whether
the element is present - the initiator gives a strong hint about that.
Somebody may say "well duh!. Of course the content region is empty if the
initiator is not present". That may be a reasonable rule, but it is not
the rule currently given in the specification. Note that the content
region has not been looked at, so that rule relies on the parser
speculatively parsing the element and then backtracking because the
initiator is not found. If we allow that, then why not allow default
values to be applied after other types of processing error ( even for
cases where no initiator was defined )? There are good reasons for not
applying defaults after normal backtracking ( hence the current rule ) so
any such 'missing initiator implies empty content' rule would have to made
explicit in the specification.
Possible refinements of the rules:
a) IF the length of the content region cannot reliably be determined (
lengthKind='delimited and separatorPolicy=suppressed ) AND
emptyValueDelimiterPolicy does not include the initiator AND the element
has an initiator AND the initiator was not found THEN assume that the
content length is zero and treat the element as missing.
or
b) IF (the element has an initiator AND the initiator was not found )THEN
IF the parent group has initiatedContent='yes' THEN the element is missing
else apply the existing rules.
b) would provide a way to get defaults applied in situations where the
content region's length is either fixed or undefined. Quite a lot of users
might assume this behaviour anyway.
Question b)
A similar situation can arise when lengthKind='explicit' and the length is
fixed ( i.e. is not a DFDL expression ). Unless the missing field occurs
at the end of a known-length structure the length of the content region
will
never be zero. I think a similar rule is required for this case also:
- IF the length of the content region is fixed ( lengthKind='explicit' and
length is a static, non-zero value ) AND emptyValueDelimiterPolicy does
not include the initiator AND the element has an initiator AND the
initiator was not found THEN assume that the content length is zero and
treat the element as missing.
...or apply suggestion b) above.
Question c)
Suppose that an element has a discriminator, and it has already evaluated
to 'true' ( it must have been a backward reference to some
previously-parsed field ). The discriminator has unambiguously stated that
the element *is* present in the data. If it is subsequently found to have
a zero-length content region, should the parser treat it as 'missing' and
attempt to apply a default?. I don't think so.
Please tell me that I'm missing something obvious here - it's starting to
sound complicated again.
regards,
Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert at uk.ibm.com
Tel. 01962-816742
Internal tel. 246742
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20100907/bfade400/attachment.html
More information about the dfdl-wg
mailing list