[DFDL-WG] postfix separators, terminators, finalTerminatorCanBeMissing
Tim Kimber
KIMBERT at uk.ibm.com
Thu Nov 19 18:44:01 CST 2009
I would like to explore the semantics of separators and terminators, and
raise a question about consistency with regard to toleration of missing
separators/terminators. Sorry for the barrage of questions lately - the
implementation is uncovering some new angles.
Relevant snippets from v0.36 of spec:
Section 14.2 Text Markup
The terminator region contains the terminator string. When a terminator is
expected it is a processing error if one of the values is not found.
However, if dfdl:finalTerminatorCanBeMissing is specified then it is not
an error if the terminator is not found.
...
When the finalTerminatorCanBeMissing property is true, then when an
element is the last element in a sequence or array, then on input, it is
not a parse error if the terminator is not found but end-of-parent or an
enclosing delimiter is encountered instead.
Section 17.3 Sequence groups with delimiters
The separator region contains one of the strings specified by the
dfdl:separator property. When this property has "" (empty string) as its
value then the separator region is of length zero.
...
‘postfix’ means the separator occurs after each element. On parsing the
separator after the last item is optional. On unparsing the final
separator will always be written.
Section 17.3.1 Sequence groups and separators
re: ordered/suppressAtEnd : All separators must be found in the data
except that when the sequence has trailing optional items, the separators
are suppressed for any final missing items.
My interpretation of the spec:
a) If an element's parent group defines a separator, that separator might
not appear after the element. Instead, the group might be terminated early
by the group's own terminator, or by the separator/terminator of an
enclosing element/group or by end-of-data.
b) On the other hand, if an element defines a terminator, that terminator
*must* appear after the element unless FTCBM="true" ( in which case the
element and its parent group can be terminated early by enclosing markup
or end of data )
c) separatorPosition="postfix" is not enforced rigidly. The input document
can always be constructed as if separatorPosition="infix" and the parser
will not complain. This allows early termination of a separated group by
enclosing markup, as well as by end-of-data.
d) The FTCBM flag allows the terminator of the final group member to be
missing. This allows early termination of the group by enclosing markup or
by end-of-data.
I have reservations about these rules.
- It seems overly lax to unconditionally allow 'postfix' to behave like
'infix'. The equivalent flexibility for a terminator requires FTCBM to be
set to "true".
- FTCBM is not as useful as it seems because it only applies to the final
group member. If the final group member is optional, the user will be
forced to use a postfix separator, and will then lose the control afforded
by FTCBM.
- DFDL needs to allow strict validation of postfix separators/terminators.
I can't see a way to achieve that with the current rules ( see example
below)
Example: Lines are separated by <lf>. Lines have up to 3 fields. Fields
can be empty. Fields are always terminated by a *.
line:field1*field2*field3*<lf>
line:field1*field2*<lf>
line:field1**field3*<lf>
With the current rules, this form of the second line
line:field1*field2<lf>
...would also be allowed: ( assuming that the * is defined as a postfix
separator with separatorPolicy="suppressAtEnd" )
Note that the missing * after field2 is silently tolerated because postfix
separators are allowed to be omitted.
To enforce the presence of the * after field2 it would have to be defined
as a terminator on every field. But that would remove the flexibility
afforded by the use of separators ( see third line )
A possible solution:
- Strictly enforce separatorPosition="postfix".
- Make terminators mandatory
- Remove the FTCBM flag, and replace it with a flag which tolerates
end-of-data where any separator/terminator was expected. The definition of
end-of-data would include the end of a defined-length parent element, but
would specifically exclude end-of-parent caused by enclosing markup (
because that would re-introduce the ambiguity which I'm trying to avoid ).
These rules are considerably tighter than the existing ones, but I don't
think they make anything impossible. I do think they make the meaning of
the various settings a lot simpler. Terminators would be less 'optional'
than before, but I suspect that the real-world scenarios would be catered
for.
Anyway - comments invited. ( invitation unnecessary, I suspect )
regards,
Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert at uk.ibm.com
Tel. 01962-816742
Internal tel. 246742
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20091120/4facd70c/attachment.html
More information about the dfdl-wg
mailing list