[DFDL-WG] A selection of example data formats

Wed Jun 15 08:52:23 CDT 2011

Hi Mike

More replies but this time I'll keep them together here as the Word doc 
would get hard to read....

Tim and I have been thinking on similar lines as your "have enough 
properties to determine that the length is zero". In addition to your 
examples there are also:
- lengthKind="prefixed" and prefix length is 0
- lengthKind="explicit" and lengthCount expression evaluates to 0

Using the same sectioning as the document...

-------------------------------------------------
a) Fixed length, no delimiters
We agree that there should be no defaulting when the length is > 0.
Need to decide whether the length = 0 case implies defaulting, we think it 
does as the property determines that the length is zero

b) Fixed length, only parent has delimiters
This boils down to whether we need to detect early termination. Spec and 
yourself are clear that scanning is off when parsing fixed length. I'd 
like to hear what Steph has to say on this.

c) Fixed length, initiators
You want to treat this the same as un-initiated fixed length. OK, but more 
on this later under i)

e) Delimited, separators required
We agree that defaults should be applied when adjacent separators 
encountered

f) Delimited, separators suppressed at end
We agree that defaults should be applied when adjacent separators 
encountered and at the end

g) Delimited, initiators, separators required
We agree that defaults should be applied when adjacent separators 
encountered

i) Delimited, initiators, separators suppressed
You want defaults to be applied when an element is entirely absent (B in 
the example)
Tim and I struggle to differentiate this case from c).  At the start of B 
processing, there is nothing in the data to indicate B and the next thing 
is C's initiator. So why is the defaulting rule different?
Take this one step further - my data is fixed length, initiated and the 
parent has a suppressed separator - so which of c) and i) applies?

How does the parser know when a group has ended? 
One of Tim's rules was when an enclosing delimiter is found.  That is not 
always the case. Tim suggested that if the immediate parent had 
lengthKind="implicit" then we would not be looking for delimiters. I 
believe your YES was agreeing with that?  We would say it is also true if 
the immediate parent had lengthKind = "explicit" or "pattern" too.

What is the algorithm for selecting the next occurrence?
Tim and I discussed this, and there is not an issue here. The 
occursCountKind always tells you the number to expect (which might be 
'don't know' if occursCountKind = "parsed" in which case we just 
speculatively parse).

When parsing a group with separatorPolicy=suppressed, is every group 
member a 'point of uncertainty'?
Agree with your statement.
----------------------------------

Other things to discuss:

Defaulting complex elements when parsing
The spec says that if zero length content is obtained for a complex 
element then it is defaulted, which means the element's complex type is 
walked and default values are sent to the infoset for required elements. 
It is an error if any required elements do not have a default value. A 
simpler alternative is to create just the element in the infoset with no 
children, but this would fail validation if switched on.

Separator position
Any rules that we agree on must take into account infix v prefix v 
postfix. In practice this determines how an element is 'bound' to a 
separator. Prefix it is bound to the beginning, postfix it is bound to the 
end, infix it is bound to the beginning except for the first element (need 
to check with Steph is that is how WTX does it).

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848

From:
"Mike Beckerle" <mbeckerle.dfdl at gmail.com>
To:
Tim Kimber/UK/IBM at IBMGB
Cc:
Steve Hanson/UK/IBM at IBMGB
Date:
11/06/2011 01:56
Subject:
RE: A selection of example data formats

My comments on your examples. I had to turn it into a word doc to 
reasonably put my commentary inline into this.

I think the concept of an element declaration being classified into:

·         Can be defaulted from nothing
·         Can be defaulted from empty content (but requires some framing 
to determine that the content is empty)
·         Cannot be defaulted (requires at least some content bits, 
possibly also some framing)

… I think this is something we’re in need of in the spec. 

If the element can be defaulted from nothing, and it is required, and we 
have nothing, i.e., no bits meaning that we have enough properties to 
determine that the length is zero, then we default it to get the infoset 
value. If it’s optional, then we don’t default it, and nothing goes into 
the infoset. 

This begs the question of “have enough properties to determine that the 
length is zero”. 

E.g., of this: end of data, end of parent, this element has no delimiters, 
but lengthKind=delimited and a parent delimiter was immediately 
encountered which terminates the element after zero bits. 
lengthKind=”pattern”, lengthPattern=”a*”, and the data has no “a” 
characters, so the length comes out zero, and no bits are consumed. 
Recursively, length is zero for a group requires same properties to hold 
inductively and for the group itself. 

I’m not sure I’ve got all the cases here, but it’s something like this. 

That’s all for my brain on DFDL today…..

From: Tim Kimber [mailto:KIMBERT at uk.ibm.com] 
Sent: Thursday, June 02, 2011 4:41 PM
To: mbeckerle.dfdl at gmail.com
Cc: Steve Hanson
Subject: A selection of example data formats

Mike, 

Steve asked me to forward this text file that I have put together. I put 
it together as background material for our discussions about the parsing 
of DFDL elements and groups. 

Key issues: 
- The specification uses the terms 'empty', 'missing' and 'known not to 
exist' in reference to elements. We need to work out what these terms mean 
so that the spec can be made clearer. 
- In my opinion, the terms 'missing' and 'known not to exist' should not 
have different meanings - it invites criticism. If 'missing' means 
something different from 'known not to exist' then we need a different 
word or phrase. 
- The application of default values for missing required elements in the 
parser is problematic. I think Steve may have sent you an email about 
this, so I won't outline the issues here ( Steve, please can you forward 
your email to me ). 

Disclaimer : This set of data formats does not highlight all of the 
unresolved questions around the parsing of groups - only the ones that 
were in play at the time I produced the document. 

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  kimbert at uk.ibm.com
Tel. 01962-816742 
Internal tel. 246742

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20110615/0fe5354d/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Formats to consider.doc
Type: application/octet-stream
Size: 50688 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/dfdl-wg/attachments/20110615/0fe5354d/attachment-0001.obj