[DFDL-WG] postfix separators, terminators, finalTerminatorCanBeMissing

Tim Kimber KIMBERT at uk.ibm.com
Thu Nov 19 18:44:01 CST 2009


I would like to explore the semantics of separators and terminators, and 
raise a question about consistency with regard to toleration of missing 
separators/terminators. Sorry for the barrage of questions lately - the 
implementation is uncovering some new angles. 

Relevant snippets from v0.36 of spec:

Section 14.2  Text Markup
The terminator region contains the terminator string. When a terminator is 
expected it is a processing error if one of the values is not found. 
However, if dfdl:finalTerminatorCanBeMissing is specified then it is not 
an error if the terminator is not found. 
...
When the finalTerminatorCanBeMissing property is true, then when an 
element is the last element in a sequence or array, then on input, it is 
not a parse error if the terminator is not found but end-of-parent or an 
enclosing delimiter is encountered instead.
Section 17.3 Sequence groups with delimiters
The separator region contains one of the strings specified by the 
dfdl:separator property. When this property has "" (empty string) as its 
value then the separator region is of length zero.
...
‘postfix’ means the separator occurs after each element. On parsing the 
separator after the last item is optional. On unparsing the final 
separator will always be written.
Section 17.3.1  Sequence groups and separators
re: ordered/suppressAtEnd : All separators must be found in the data 
except that when the sequence has trailing optional items, the separators 
are suppressed for any final missing items. 

My interpretation of the spec:
a) If an element's parent group defines a separator, that separator might 
not appear after the element. Instead, the group might be terminated early 
by the group's own terminator, or by the separator/terminator of an 
enclosing element/group or by end-of-data.
b) On the other hand, if an element defines a terminator, that terminator 
*must* appear after the element unless FTCBM="true" ( in which case the 
element and its parent group can be terminated early by enclosing markup 
or end of data )
c) separatorPosition="postfix" is not enforced rigidly. The input document 
can always be constructed as if separatorPosition="infix" and the parser 
will not complain. This allows early termination of a separated group by 
enclosing markup, as well as by end-of-data.
d) The FTCBM flag allows the terminator of the final group member to be 
missing. This allows early termination of the group by enclosing markup or 
by end-of-data.

I have reservations about these rules. 
- It seems overly lax to unconditionally allow 'postfix' to behave like 
'infix'. The equivalent flexibility for a terminator requires FTCBM to be 
set to "true".
- FTCBM is not as useful as it seems because it only applies to the final 
group member. If the final group member is optional, the user will be 
forced to use a postfix separator, and will then lose the control afforded 
by FTCBM.
- DFDL needs to allow strict validation of postfix separators/terminators. 
I can't see a way to achieve that with the current rules ( see example 
below)

Example: Lines are separated by <lf>. Lines have up to 3 fields. Fields 
can be empty. Fields are always terminated by a *. 
line:field1*field2*field3*<lf>
line:field1*field2*<lf>
line:field1**field3*<lf>

With the current rules, this form of the second line 
line:field1*field2<lf>
...would also be allowed: ( assuming that the * is defined as a postfix 
separator with separatorPolicy="suppressAtEnd" )
Note that the missing * after field2 is silently tolerated because postfix 
separators are allowed to be omitted.
To enforce the presence of the * after field2 it would have to be defined 
as a terminator on every field. But that would remove the flexibility 
afforded by the use of separators ( see third line )

A possible solution:
- Strictly enforce separatorPosition="postfix".
- Make terminators mandatory
- Remove the FTCBM flag, and replace it with a flag which tolerates 
end-of-data where any separator/terminator was expected. The definition of 
end-of-data would include the end of a defined-length parent element, but 
would specifically exclude end-of-parent caused by enclosing markup ( 
because that would re-introduce the ambiguity which I'm trying to avoid ).

These rules are considerably tighter than the existing ones, but I don't 
think they make anything impossible. I do think they make the meaning of 
the various settings a lot simpler. Terminators would be less 'optional' 
than before, but I suspect that the real-world scenarios would be catered 
for.

Anyway - comments invited. ( invitation unnecessary, I suspect )

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  kimbert at uk.ibm.com
Tel. 01962-816742 
Internal tel. 246742






Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU






-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20091120/4facd70c/attachment.html 


More information about the dfdl-wg mailing list