[DFDL-WG] Fw: Editorial improvements for section 14.2

Steve Hanson smh at uk.ibm.com
Mon Jan 7 12:38:00 EST 2013


For discussion on next DFDL WG call.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 07/01/2013 17:32 -----

From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     Steve Hanson/UK/IBM at IBMGB, 
Cc:     Tim Kimber/UK/IBM at IBMGB
Date:   11/12/2012 17:15
Subject:        Re: Editorial improvements for section 14.2




Some added discussion on top of steve's on 14.2 separator property.


From:        Tim Kimber/UK/IBM 
To:        mbeckerle.dfdl at gmail.com, Steve Hanson/UK/IBM at IBMGB, 
Date:        10/12/2012 15:14 
Subject:        Editorial improvements for section 14.2 



A couple of things that I noticed while looking through the specification 
today: 

14.2        Title 
Section title should really be 'Sequence groups with separators'. 

SMH: Agree 

14.2         Description of 'separator' property 
"Specifies a whitespace separated list of alternative literal strings that 
are the possible separators between a sequence of elements or multiple 
occurrences of an element."
A separator applies to all members of a group, but this only talks about 
elements.
Suggestion: "Specifies a list of alternative separator values for the 
group. Each separator value is a DFDL string literal. If there is more 
than one separator in the list then the values are separated by white 
space."
I purposely omitted the point about multiple occurrences; I think it needs 
a separate description, unless we think that the tables make it clear 
enough. 
SMH: The wording here is very like that for initiator and terminator. The 
property type already has said that the strings are DFDL string literals. 
So I would say: 
"Specifies a whitespace separated list of alternative literal strings that 
are the possible separators for the sequence. Separators occur in the data 
either before, between or after all occurrences of the elements or groups 
that are the children of the sequence." 
14.2         Description of 'separator' property 
"This property can be computed by way of an expression which returns a 
string of whitespace separated values. 
It is a Schema Definition Error if the expression returns an empty string
The expression must not contain forward references to elements which have 
not yet been processed." 
The later sentence about expressions that return an empty string could 
then be removed - I think it belongs in this paragraph. 
Also, there is a change in the text style midway through the paragraph. 

14.2         Description of 'separator' property 
"When parsing, the list of values is processed in a greedy manner, meaning 
it takes all the separators, that is, each of the string literals in the 
white space separated list, and matches them each against the data. In 
each case the longest possible match is found. The separator with the 
longest match as the one that is selected as having been ‘found’, with 
length-ties being resolved so that the matching separator is selected that 
is first in the order written in the schema. Once a matching separator is 
found, no other shorter matches will be subsequently attempted (ie, there 
is no backtracking to try parsing based on shorter separator matches)." 
I don't know what the correct wording is, but this is not it :-) 
This is a very complex piece of logic to describe, but it is fairly 
central to the parsing algorithms. If we don't get it right then we will 
end up with divergent DFDL implementations. I honestly don't know where or 
how we should be describing the delimiter parsing logic - can we discuss 
on the next WG call? 

SMH: This paragraph is solely describing how the matching works, not 
anything else. It is independent of lengthKind. This wording was agreed 
under errata 2.70 and is used for initiator and terminator as well. What 
specifically is the issue? 

MB: It's really unfortunate that there's this ambiguity about length-ties. 
But those can come up due to the character class entities. I.e., I can 
write separator="%SP;|%SP; %WSP+|%WSP+;" and both those would match as a 
separator of a and b in  "a | b". 

MB: However, I'm not sure the above purple wording is really needed about 
length-ties. If a separator longest-matches, we're done. We don't really 
care if there are two separator patterns that are ambiguous and can match 
the same thing. If they both match the same 'longest' match, then the 
separator was found. 
14.2         Description of 'separator' property 
"If a child element uses an escape scheme, then the escape scheme also 
applies to any separator."
What does this mean? Can we remove it? 
SMH: It means that when unparsing a child element then an occurrence of 
the separator in the value will be escaped. 

14.1        Empty Sequences 
Doesn't seem right to have this as the very first sub-section. Can we make 
it the last, and move the other sections up by one? Or at least swap it 
with 14.2? 

SMH: I don't see it makes much difference where it goes. So on the grounds 
of spec renumbering I'd prefer if it stayed where it is. 

regards,

Tim Kimber, DFDL Team,
Hursley, UK
Internet:  kimbert at uk.ibm.com
Tel. 01962-816742  
Internal tel. 37246742


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU



-- 
Mike Beckerle | OGF DFDL WG Co-Chair | Tresys Technologies
Tel:  781-330-0412


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130107/4236e06f/attachment.html>


More information about the dfdl-wg mailing list