[DFDL-WG] Fw: DFDL and the truncated SAP File IDoc format

Wed Jul 4 06:57:19 EDT 2012

Agreed on last call that dfdl:lengthKind 'delimited' would not be changed, 
specifically it will not attempt to look for in-scope delimiters 
in-between child elements whose lengthKind is not 'delimited'.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 04/07/2012 11:53 -----

From:   Steve Hanson/UK/IBM
To:     dfdl-wg at ogf.org
Date:   11/06/2012 11:51
Subject:        Fw: DFDL and the truncated SAP File IDoc format

For next DFDL WG call. Some thoughts on whether lengthKind 'delimited' 
should be able to model this without resorting to asserts. Read from 
bottom.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 11/06/2012 11:50 -----

From:   Tim Kimber/UK/IBM
To:     Mike Beckerle <mbeckerle.dfdl at gmail.com>
Cc:     Steve Hanson/UK/IBM at IBMGB
Date:   30/05/2012 21:23
Subject:        Re: Fw: DFDL and the truncated SAP File IDoc format

Thanks Mike - useful input. I've added my comments in <tk> tags

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  kimbert at uk.ibm.com
Tel. 01962-816742 
Internal tel. 246742

From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     Steve Hanson/UK/IBM at IBMGB
Cc:     Tim Kimber/UK/IBM at IBMGB
Date:   30/05/2012 19:59
Subject:        Re: Fw: DFDL and the truncated SAP File IDoc format

Hmmm.

We discussed at one time whether there are actually 2 different delimiting 
schemes. One is what we have now. Let me call this "delimited1". In 
delimited1, an enclosing parent's delimiter cannot be used in isolation to 
find the extent of the data, because child elements might have escape 
schemes defined which escape even the parent delimiter, so you still have 
to use the recursive definition of the children when parsing.  This is a 
very powerful mode of parsing. However, many things that might be errors 
(putting a binary field in the middle of a bunch of text fields), would be 
tolerated by this regime, because scanning would be turned on/off 
appropriately.

I struggle, however, with whether delimited1 is really the same thing as 
"implicit". I mean if you define an element as 'implicit' but it has a 
terminator, then after you unwind from the recursion you are still going 
to then look for the terminator, so it's not like the delimiters are being 
ignored.
<tk>
I think of it this way. The lengthKind property is about the length of the 
*content* region. So 'delimited1' is, I think, the same as 'implicit' for 
the purposes of finding the length of the content region. If the complex 
element has a terminator then the terminator will be expected at the byte 
offset that immediately follows the end of the content region - whether 
lengthKind is 'delimited' or 'implicit'. In other words, I'm modifying 
your description of the behaviour to "after you unwind from the recursion 
you are still going to then look for the terminator at the byte offset 
immediately following the element's content"
</tk>
The other definition of delimited (let's called it delimited2), would be 
where you get to completely disregard the children when searching for the 
parent delimiter. Many things appearing within the children would be SDE. 
E.g., binary format children would be an SDE, etc. Delimited2 would imply 
that the children are all representation="text", and the scan for the 
parent delimiter would be irrespective of any delimiters and escape 
schemes being put in place by child elements. So for example, the last 
child inside a delimited2 parent could have length kind = "endOfData" just 
fine, because we can isolate the "box" of data first, and then parse the 
children within it, with the last child extending to the end of the "box". 

<tk>
You mean 'endOfParent' but it doesn;t change your point, which is valid.
My concern with your description is the implication that the parser needs 
to scan the same data multiple times. Maybe there are ways to analyse the 
model and avoid that necessity for many types of model, but that may be 
easier said than done. 
My proposal was to respect the lengthKind of each child element within the 
parent delimited element, but to check for the terminator of the element, 
of its main group, and for any other enclosing terminating delimiters 
before continuing to parse any member of the group. I'm prepared to be 
convinced that this approach is shot full of logical inconsistencies, btw.
</tk>

...mike

On Wed, May 30, 2012 at 1:10 PM, Steve Hanson <smh at uk.ibm.com> wrote:
Hi Mike 

Interested in your opinion on this one...it was prompted by looking at the 
best way to model a format where each record consisted of fixed length 
optional fields 1 to n followed by an EOR indicator, where missing 
trailing fields are suppressed.  Kind of analogous to suppressing trailing 
delimiters for empty fields.   

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848 
----- Forwarded by Steve Hanson/UK/IBM on 30/05/2012 18:05 ----- 

From:        Tim Kimber/UK/IBM 
To:        Steve Hanson/UK/IBM at IBMGB 
Date:        15/05/2012 12:27 
Subject:        Fw: DFDL and the truncated SAP File IDoc format 

I've thought about this a bit more... 

The already-existing rule about lengthKind=delimited versus 
lengthKind=implicit only applies when the parser is about to parse the 
content region of an element, and needs to decide whether to recurse into 
its content. If the elements own lengthKind is 'delimited' then it does 
not recurse. The rule that you are proposing goes further than that, and 
requires that lengthKind=delimited is taken literally; the length of the 
complex element truly is defined by the in-scope delimiters, including its 
own terminator. I like that rule, actually - it gives real meaning to 
lengthKind=delimited. The problem is defining the behaviour, because the 
rule has implications for the parsing of the element's group. Before 
parsing each member of the group ( required or not, I think) , the parser 
must check for in-scope delimiters. This only needs to happen if the 
immediate parent of the group is an element with lengthKind=delimited or 
endOfParent. I'm sure there are edge cases around this ( what about 
embedded groups ) so we should discuss this with Mike. 

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  kimbert at uk.ibm.com
Tel. 01962-816742  
Internal tel. 246742

----- Forwarded by Tim Kimber/UK/IBM on 15/05/2012 12:12 ----- 

From:        Tim Kimber/UK/IBM 
To:        Steve Hanson/UK/IBM at IBMGB 
Date:        15/05/2012 11:19 
Subject:        Re: DFDL and the truncated SAP File IDoc format 

 When a modeller sets lengthKind to 'delimited' they are implicitly 
claiming that the element's content region will not contain any of the 
in-scope delimiters ( unless they are escaped ). That makes it safe for 
the parser to look for *all* in-scope delimiters when scanning. When they 
set lengthKind='explicit' they are not making any such claim. 
Well...nearly.  We already have a rule in DFDL that distinguishes between 
a strict behaviour when lengthKind='implicit' a lax-but-more-efficient 
behaviour when lengthKind=delimited. I think that may be the justification 
for your rule. 

This has prompted me to think about how we discuss this delimited/implicit 
distinction in the DFDL specification. I think it might be useful to cast 
the discussion in terms of what is allowed in the content of the element. 
If the parser might encounter the already-in-scope delimiters as part of 
its content ( either within explicit-length fields or as the delimiters of 
child elements/groups )  then lengthKind must be 'implicit'. If the parser 
can safely assume that delimiters never occur within the element's 
content, or that they are always escaped, then lengthKind='delimited' is 
the better choice. 

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  kimbert at uk.ibm.com
Tel. 01962-816742  
Internal tel. 246742

From:        Steve Hanson/UK/IBM 
To:        Tim Kimber/UK/IBM 
Date:        15/05/2012 09:20 
Subject:        DFDL and the truncated SAP File IDoc format 

Hi Tim 

Looking at Emma's format got me thinking about errata 3.3. 

3.3. Section 12.3. Clarify that when property is lengthKind 'explicit', 
'implicit' (simple only), 'prefixed' or 'pattern', it means that delimiter 
scanning is turned off and in-scope delimiters are not looked for within 
or between elements. 

I am absolutely clear on why the parser would not want to look for 
in-scope delimiters within such elements. I'm also happy not to look for 
delimiters between elements if the element is required. But why shouldn't 
the parser look between elements when the element is optional?  Or at 
least when the remaining content is all optional?  There's an analogy here 
with trailing separator suppression, that I don't think we spotted before. 
 Were we worried that users would be using unescaped characters because 
the data is fixed length? 

If my format was some required fixed length fields followed by some 
optional fixed length fields, with an indicator for end of record, I would 
like to be able to model it very simply, as follows.   

<xs:element name="record" dfdl:lengthKind="delimited" 
dfdl:terminator="%LF;" > 
  <xs:complexType> 
    <xs:sequence> 
      <xs:element name="A" type="xs:string" dfdl:lengthKind="explicit" 
dfdl:length="10" /> 
      <xs:element name="B" type="xs:string" dfdl:lengthKind="explicit" 
dfdl:length="10" /> 
      <xs:element name="C" type="xs:string" dfdl:lengthKind="explicit" 
dfdl:length="10" minOccurs="0"  /> 
      <xs:element name="D" type="xs:string" dfdl:lengthKind="explicit" 
dfdl:length="10" minOccurs="0" /> 
      <xs:element name="E" type="xs:string" dfdl:lengthKind="explicit" 
dfdl:length="10" minOccurs="0" /> 
    </xs:sequence> 
  </xs:complexType> 
</xs:element> 

If DFDL doesn't allow this it means I need either 
dfdl:lengthKind="pattern" on the record element, or I need an assert on 
each element checking the content is not line feed. 
You can argue that using 'pattern' instead of 'delimited' is no big deal, 
but using 'delimited' is a more natural fit and what a modeler would think 
of first. 

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

-- 
Mike Beckerle | OGF DFDL WG Co-Chair 
Tel:  781-330-0412

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20120704/e62d5494/attachment-0001.html>