[DFDL-WG] Definition of 'missing element' - some edge cases

Tim Kimber KIMBERT at uk.ibm.com
Tue Sep 7 05:57:39 CDT 2010


I'm going to send this and then duck - we've discussed the subject of 
missing-ness and defaulting at considerable length already. However, I 
genuinely do have some new information for your consideration so please 
hear me out.

I'm seeking the opinion of the working group on the following questions:
a) can an element reliably be categorised as 'missing' when 
separatorPolicy='suppressed'?
b) is it possible for an element to be 'missing' if it has 
lengthKind='explicit' and its length is a static, non-zero value?
c) is it possible for an element to be 'missing' if it has a discriminator 
that has already evaluated to 'true'.

For reference, the specification ( v0.42 ) says this concerning missing 
elements:
Definition 'missing element'
On parsing, an element is missing if its content region in the data stream 
is empty. The initiator and terminator regions of a missing element may, 
or may not, also be empty as controlled by the 
dfdl:emptyValueDelimiterPolicy property (simple and complex element), or 
dfdl:nilValueDelimiterPolicy property (simple element),  .


Question a), 
Compare the following data streams. In both cases, assume that 
- separator is comma and separatorPosition is 'infix'
- missingValueDelimiterPolicy is set to 'none' so a 'missing' value should 
not have an initiator.
- the initiators are A:, B: and C: 
- values are a,b,c. 

separatorPolicy='required' : A:a,,C:c
separatorPolicy='suppressed' : A:a,C:c

In the 'required' case, the parser detects that the initiator is missing, 
then looks to see whether the content region is zero-length. It is, so the 
element is 'missing'.
In the 'suppressed' case, the parser detects that the initiator is 
missing, then looks to see whether the content region is zero-length. It 
looks for a delimiter at the current position and finds 'C'. 'C' is not a 
delimiter, so the content region is not zero-length. So the parser throws 
a processing error - "initiator for element B was not found in the data".

I don't think the 'suppressed' behaviour is what a user will expect, nor 
what the WG intended when these rules were drawn up. The problem is that 
the parser cannot reliably determine the length of the content region when 
separatorPolicy='suppressed'.  It can, however, reliably detect whether 
the element is present - the initiator gives a strong hint about that.
Somebody may say "well duh!. Of course the content region is empty if the 
initiator is not present". That may be a reasonable rule, but it is not 
the rule currently given in the specification. Note that the content 
region has not been looked at, so that rule relies on the parser 
speculatively parsing the element and then backtracking because the 
initiator is not found. If we allow that, then why not allow default 
values to be applied after other types of processing error ( even for 
cases where no initiator was defined )? There are good reasons for not 
applying defaults after normal backtracking ( hence the current rule ) so 
any such 'missing initiator implies empty content' rule would have to made 
explicit in the specification.

Possible refinements of the rules:
a) IF the length of the content region cannot reliably be determined ( 
lengthKind='delimited and separatorPolicy=suppressed ) AND 
emptyValueDelimiterPolicy does not include the initiator AND the element 
has an initiator AND the initiator was not found THEN assume that the 
content length is zero and treat the element as missing.
or
b) IF  (the element has an initiator AND the initiator was not found )THEN 
IF the parent group has initiatedContent='yes' THEN the element is missing 
else apply the existing rules.

b) would provide a way to get defaults applied in situations where the 
content region's length is either fixed or undefined. Quite a lot of users 
might assume this behaviour anyway.

Question b)
A similar situation can arise when lengthKind='explicit' and the length is 
fixed ( i.e. is not a DFDL expression ). Unless the missing field occurs 
at the end of a known-length structure the length of the content region 
will 
never be zero. I think a similar rule is required for this case also:

- IF the length of the content region is fixed ( lengthKind='explicit' and 
length is a static, non-zero value ) AND emptyValueDelimiterPolicy does 
not include the initiator AND the element has an initiator AND the 
initiator was not found THEN assume that the content length is zero and 
treat the element as missing.

...or apply suggestion b) above.

Question c)
Suppose that an element has a discriminator, and it has already evaluated 
to 'true' ( it must have been a backward reference to some 
previously-parsed field ). The discriminator has unambiguously stated that 
the element *is* present in the data. If it is subsequently found to have 
a zero-length content region, should the parser treat it as 'missing' and 
attempt to apply a default?. I don't think so.

Please tell me that I'm missing something obvious here - it's starting to 
sound complicated again.

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  kimbert at uk.ibm.com
Tel. 01962-816742 
Internal tel. 246742






Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU





-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20100907/bfade400/attachment.html 


More information about the dfdl-wg mailing list