[DFDL-WG] endOfData - was: RE: FW: MIke's notes from call on 2008-08-13

Wed Oct 29 08:31:08 CDT 2008

I agree that "end" or whatever we decide to call it should be reserved for 
the last object in a sequence. I prefer "endOfParent".

I have a general unease around the lengthKind enum "implicit".  It 
originally meant something quite specific, the length was derived from the 
underlying xsd. That's now been extended for text decimals to mean derived 
from the textNumberPattern pattern length. And for a sequence to mean 
derived from the length of its children. I think we are overloading it. I 
think that "implicit" should be reserved for simple elements only, with 
its current semantic. And we should come up with a new enum, reserved for 
complex elements or sequences only, suggest "children" (given I have also 
suggested "endOfParent") or maybe "content".

Regards

Steve Hanson
Programming Model Architect
WebSphere Message Brokers
Hursley, UK
Internet: smh at uk.ibm.com
Phone (+44)/(0) 1962-815848

"Mike Beckerle" <mbeckerle.dfdl at gmail.com> 
28/10/2008 00:48
Please respond to
<mbeckerle.dfdl at gmail.com>

To
Steve Hanson/UK/IBM at IBMGB
cc
Alan Powell/UK/IBM at IBMGB
Subject
RE: endOfData - was: RE: FW: MIke's notes from call on 2008-08-13

Not sure where this leaves us. 

It is ok to reserve lengthKind="end" or "parent" or whatever for the last 
element of a sequence. 

...mike

Mike Beckerle | OGF DFDL WG Co-Chair | CTO | Oco, Inc.
Tel:  781-810-2100  | 504 Totten Pond Road, Waltham MA 02451 | 
mbeckerle.dfdl at gmail.com 

From: Steve Hanson [mailto:smh at uk.ibm.com] 
Sent: Wednesday, October 22, 2008 12:56 PM
To: mbeckerle.dfdl at gmail.com
Cc: Alan Powell
Subject: Re: endOfData - was: RE: FW: MIke's notes from call on 2008-08-13

Mike - sorry but I think users will find this baffling. 
lengthKind="implicit" was intended to mean that the logical xsd provided 
the length. lengthKind="delimited" means that markup provides the length. 
We are overloading the word "implicit" and we are wrong to do so. Trying 
to wrap these together, and include "endOfData" (as "parent") as well, is 
taking the abstraction too far. It is not how people view their data. 

Regards

Steve Hanson
Programming Model Architect
WebSphere Message Brokers
Hursley, UK
Internet: smh at uk.ibm.com
Phone (+44)/(0) 1962-815848 

"Mike Beckerle" <mbeckerle.dfdl at gmail.com> 
22/10/2008 15:29 

Please respond to
<mbeckerle.dfdl at gmail.com>

To
Alan Powell/UK/IBM at IBMGB, Steve Hanson/UK/IBM at IBMGB 
cc

Subject
endOfData - was: RE: FW: MIke's notes from call on 2008-08-13

First: apologies for missing the call today without notice. I've been 
solid on a rather urgent customer-related matter since before our meeting 
time and unable to break away. 

Now: w.r.t. end of data email from Steve. 

In the example you highlight, the reason both children of the sequence 
have lengthKind="endOfData" is that the parent is providing the way of 
determining the length, in this case using delimiters. Conceptually, the 
parser can carve out the box of data bytes for the first child by scanning 
for the separator, and the box for the 2nd child by scanning for the 
terminator. Then it can present those finite size boxes to the parser to 
parse each child, and each child consumes the entire box, i.e., to the end 
of (its box of) data. 

However, I agree the notion of "endOfData" is confusing as I have just 
explained it above. 

Perhaps the right  lengthKind for a child to have when the enclosing 
parent has a terminator or separator is 

     lengthKind="parent" 

which you can read conceptually as: 

     "length kind for this child is determined by something specified in 
the parent. So you'll find nothing here about length." 

We could then drop the whole "endOfData" concept entirely. 

So in the example, both children would still have lengthKind="parent". 

The implied "parent" of the top level is the real true "end of the data", 
so a top-level element could have lengthKind="parent" also. This is an 
important composition property. It allows you to take a well specified 
format and drop it in as the description of an MQ message payload, for 
example. 

Now, lengthKind="parent" is kind of the opposite of lengthKind="implicit". 
"parent" is top down, i.e., from the enclosing structure. "implicit" is 
bottom up, i.e., length implied by the contents of the element. 

Here's a trick that can make this all more palatable. For certain kinds of 
child elements, lengthKind="implicit" will behave as lengthKind="parent". 
This would happen for variable length children without any way of 
determining the variable length "bottom up". Examples of this are: 
variable length text strings, variable occurrances of anything (with no 
way to determine how many occurrances), or ordered sequences whose final 
element is a variable length child without any way of determining the 
variable length. (This definition is recursive intentionally.) 

Given this, I think the DFDL fragment could be: 

<complexType dfdl:lengthKind="implicit" 
             dfdl:representation="text" > // these are in the scope 
.... 
<sequence dfdl:separator=?,? dfdl:terminator=?;? 
          dfdl:lengthKind="delimited"> 
   <element name=?f1? type=?string? /> 
   <element name=?f2? type=?string? /> 
</sequence> 
.... 
</complexType> 

Which I claim is what we want to have to write to capture the simple thing 
this is trying to express, which is the format of "string1,string2;" after 
all. 

Comments? 

BTW: notice my use of an enclosing complexType and ellipsis in order to 
achieve the notion that certain property bindings surround the example. 
This is one of the reasons I think we don't need a full up 2-level 
semantic model as Sandy suggested. I think examples like the above are 
sufficiently clear, particularly given the simplfied scoping. 
Mike Beckerle | OGF DFDL WG Co-Chair | CTO | Oco, Inc.
Tel:  781-810-2100  | 504 Totten Pond Road, Waltham MA 02451 | 
mbeckerle.dfdl at gmail.com 

From: Steve Hanson [mailto:smh at uk.ibm.com] 
Sent: Wednesday, October 22, 2008 9:09 AM
To: Mike Beckerle
Cc: Alan Powell
Subject: Re: FW: MIke's notes from call on 2008-08-13

Hi Mike 

I owe a review of the "EndOfData Semantics" discussion below. 

The only thing that looks slightly odd in the examples below is this: 

It doesn't seem right for f1 to have "endOfData". Should we have a rule 
that says "endOfData" is only allowed on the last object in a sequence? 
After all, that was its original - a way of the last thing saying it is 
bounded by the end of its parent. 

Would "endOfParent" be better than "endOfData" ? 

Regards

Steve Hanson
Programming Model Architect
WebSphere Message Brokers
Hursley, UK
Internet: smh at uk.ibm.com
Phone (+44)/(0) 1962-815848 

"Mike Beckerle" <mbeckerle.dfdl at gmail.com> 
10/09/2008 14:10 

Please respond to
<mbeckerle.dfdl at gmail.com>

To
Steve Hanson/UK/IBM at IBMGB 
cc

Subject
FW: MIke's notes from call on 2008-08-13

Mike Beckerle | OGF DFDL WG Co-Chair | CTO | Oco, Inc.
Tel:  781-810-2100  | 504 Totten Pond Road, Waltham MA 02451 | 
mbeckerle.dfdl at gmail.com 

From: Mike Beckerle [mailto:mbeckerle.dfdl at gmail.com] 
Sent: Friday, August 15, 2008 11:53 AM
To: dfdl-wg at ogf.com
Subject: MIke's notes from call on 2008-08-13 

Only Alan Powell and myself were on the call. 

These are my notes. 

TOPIC: Decimal Calendar ? idea: should behave as if decimal to text then 
text to date/time. I.e., use same date/time pattern language, but a subset 
of it since decimal can express nothing but digits. 

TOPIC: Notes to authors (at start of spec) add that we don?t do scalar 
type coersions/conversions generally. I.e., if the representation is a 
floating point, then the logical must be a floating point. If the 
representation is decimal, the logical must be decimal. We don?t allow you 
to have a logical int whose rep is decimal or vice versa. Rationale: it 
adds complexity that we an avoid. Doesn?t provide anything you can?t 
easily do another way (layering), etc. 

TOPIC: EndOfData Semantics: 

We discussed that currently we were overloading the delimited concept to 
include the end-of-data concept, and that was unsatisfactory and was 
resulting in attempts to reinject end-of-data as ?end-of-bitstream? and 
the like. 

Points - 

Distinguish delimited to mean we positively ARE scanning for a text 
pattern delimiter, and not confusing this with the end-of-data case which 
is fundamentally different. 
Avoid special-case keyword only for the ?top level? end of the data 
stream. This has really bad composition properties. 
lengthKind=?endOfData? applies to both binary and text representations. 
For text it means there is no terminator for this element. The enclosing 
construct?s length, however determined (separator, terminator, fixed, 
prefix, etc.) will bound length of this contained element. 
Case: <sequence dfdl:separator=?,? dfdl:terminator=?;?> 
                <element name=?f1? type=?string? 
dfdl:lengthKind=?endOfData?/> 
                <element name=?f2? type=?string? 
dfdl:lengthKind=?endOfData?/> 
            </sequence> 
The above seems ok to me. 
Case: <sequence dfdl:lengthKind=?prefixed? dfdl:representation=?binary?> 
                  <element name=?f1? type=?int? dfdl:length=?4? 
dfdl:lengthKind=?explicit?> 
                   <element name=?f2? type=?hexBinary? 
dfdl:lengthKind=?endOfData?> 
           </sequence> 
The above seems ok to me.
Important use cases: 
Case 1: binary element at the end of a top-level sequence. 

<schema ?> 
<element name=?theTop?> 
        <complexType> 
             <sequence dfdl:lengthKind=?implicit?> 
                        <element name=?f1? type=?int? dfdl:length=?4? 
dfdl:lengthKind=?explicit?/> 
                  <element name=?f2? type=?hexBinary? 
dfdl:lengthKind=?endOfData?/> 
              </sequence> 
         </complexType> 
</element> 
</schema> 

In the above, the top level sequence has implicit length kind. This is ok, 
because the top level is assumed to be in an ?end of data? context. 

Case 2: deeper nesting, same implicit-length sequence. 

<schema ?> 
<element name=?NestedInside?> 
        <complexType> 
             <sequence dfdl:lengthKind=?implicit?> 
                        <element name=?f1? type=?int? dfdl:length=?4? 
dfdl:lengthKind=?explicit?/> 
                  <element name=?f2? type=?hexBinary? 
dfdl:lengthKind=?endOfData?/> 
              </sequence> 
         </complexType> 
</element> 

<element name=?stillNotTheTop?> 
         <complexType> 
             <sequence dfdl:lengthKind=?implicit?> 
                        ? 
                  <element ref=?NestedInside?/> 
              </sequence> 
         </complexType> 
</element> 

<element name=?hasFixedLength?> 
         <complexType> 
             <sequence dfdl:lengthKind=?explicit? dfdl:length=?100?> 
                        ? 
                  <element ref=?stillNotTheTop?/> 
              </sequence> 
         </complexType> 
</element> 

?. 
</schema> 

This case illustrates how the composition properties work for 
explicit/implicit lengths. 

The definition of how this works goes something like this. 

When the last element of a sequence is binary with lengthKind=?endOfData? 
this implies that the enclosing sequence is: 
(a)   length kind explicit or prefixed or endOfdata 
(b)   length kind implicit ? in this case recursively this enclosing 
sequence must itself be enclosed in a sequence similarly constrained on 
length kind (cases a, b, c here) 
(c)   the top-level sequence 

Note: We need to revisit whether the name ?endOfData? is desirable or not. 
There?s a list of alternatives from the F2F meeting. Problem is that a 
naïve user will be thinking ?top level? but the concept actually needs to 
be compositional/nestable. 

TOPIC: float/double ? we concluded that until XML has floating point types 
that can handle extended precisions that DFDL can?t handle extended 
precisions in any reasonable way, so we should simply say DFDL v1.0 
supports only 64-bit floating point precision and 32 bit floating point 
precision. This narrows down float types to IEEE (single and double), and 
IBM390 (single and double), and maybe AS400 if that?s different and still 
within 64 bits precision. 

Mike Beckerle | OGF DFDL WG Co-Chair | CTO | Oco, Inc.
Tel:  781-810-2100  | 504 Totten Pond Road, Waltham MA 02451 | 
mbeckerle.dfdl at gmail.com 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20081029/69071f05/attachment-0001.html