[DFDL-WG] Fw: Action 242 - valueLength and contentLength function wording

Tue Apr 15 10:46:15 EDT 2014

I had a quick look at the document and Steve's review comments.

The text tries to define valueLength() in terms of the info set value, or 
states that it is 'calculated from' the infoset value. I believe this will 
mislead many readers ( it already has done, in fact ). People just assume 
that the 'value length' is 'the length of the logical value' and they 
don't think any more deeply about it. So it is absolutely essential to 
state clearly that this function returns the length of a region in the 
data stream.
It may be simpler to define the behaviour in terms of the length of the 
simpleContent or complexContent region, minus any padding.

The other point is around the bytes/characters issue. Mike's words, or 
something similar, are definitely required because we don't mandate that 
the encoding must be consistent throughout a complex type. Nor do we 
prohibit a mixture of character and non-character content in a complex 
type. And someone might even call valueLength() with lengthUnits 
'characters' on a complex type that contains no character data at all.

In principle it would be possible for the value to be available while 
parsing ( for elements that follow the specified element ). If we are 
disallowing this then it should be stated very clearly somewhere. And 
probably is already.

regards,

Tim Kimber, 
IBM Integration Bus Development (Industry Packs)
Hursley, UK
Internet:  kimbert at uk.ibm.com
Tel. 01962-816742 
Internal tel. 37246742

----- Forwarded by Tim Kimber/UK/IBM on 15/04/2014 15:28 -----

From:   Steve Hanson/UK/IBM at IBMGB
To:     Mike Beckerle <mbeckerle.dfdl at gmail.com>, 
Cc:     "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, dfdl-wg-bounces at ogf.org
Date:   15/04/2014 15:01
Subject:        Re: [DFDL-WG] Action 242 - valueLength and contentLength 
function        wording
Sent by:        dfdl-wg-bounces at ogf.org

Review comments added: 

Regards

Steve Hanson
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848 

From:        Mike Beckerle <mbeckerle.dfdl at gmail.com> 
To:        "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, 
Date:        11/04/2014 14:04 
Subject:        Re: [DFDL-WG] Action 242 - valueLength and contentLength 
function        wording 
Sent by:        dfdl-wg-bounces at ogf.org 

Revised Action 242 proposed changes word doc attached. I have incorporated 
the discussion in this thread (I hope.) Please review. 

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com 
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject to the OGF Intellectual Property Policy 

On Tue, Mar 25, 2014 at 10:56 AM, Mike Beckerle <mbeckerle.dfdl at gmail.com> 
wrote: 

This language is consistent with what we say for lengthKind pattern in 
section 12.3.5: 

"When unparsing, the dfdl:valueLength of a complex type element when the 
length units is 'characters' is computed as if the entire structure was 
unparsed into a temporary data stream beginning at position 1, and then 
this data stream is considered to be text in the character set encoding 
specified by the dfdl:encoding property, regardless of the actual 
representation of the complex type element or the elements contained 
within it. The number of characters in this temporary data stream is the 
value length of the complex type."

The behavior of the IBM DFDL implementation for valueLength is as 
described is consistent with the above, excepting that it will not detect 
a decode error, and it gives an SDE (?) if the encoding is not fixed 
width. 

Since we have decided not to require that a complex type element is 
recursively all text all the way down, I believe we have to tolerate 
implementations having different behaviors in the potentially meaningless 
cases where there is binary data or encoding changes in the complex type. 
So I would add to the above suggested language this:

"However, if creation of this data stream would cause an encoding error, 
or parsing of this data stream as characters would cause a decoding error, 
then the behavior and return value of dfdl:valueLength are implementation 
dependent."

Looking at the DFDL spec, I am concerned that we never really say what we 
mean by the "length of the ComplexContent region." (Last sentence before 
Table 7 in section 12.3.7) Section 12.3.7.3 doesn't do it. The 
dfdl:valueLength function may be the first place where we have to actually 
say how the various sub-regions contribute to the ComplexContent region's 
length. 

I believe this is the obvious "sum of length of all contained regions", 
but keep in mind that alignment region lengths will vary depending on the 
starting alignment, so the length is, in general, dependent on the 
position within the bit stream.

Hence when unparsing we have to specify that the dfdl:valueLength is 
measured as if the ComplexContent region started at position 1 (as I did 
above) so that internal alignment regions can be given meaningful lengths. 

The general clarification should be added to 12.3.7.3, or to section 
12.3.7 immediately before section 12.3.7.1. Something like this:

"The length of the ComplexContent region is the sum of the lengths of the 
contained regions. However, note that alignment regions inside the 
ComplexContent may be of different lengths depending on the 
ComplexContent's starting position alignment." 

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com 
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject to the OGF Intellectual Property Policy 

On Mon, Mar 24, 2014 at 11:34 AM, Andrew Edwards <andy.edwards at uk.ibm.com> 
wrote: 
Steve (et al) - Resending as the last one bounced. 

I'll usurp Tim and respond :) 

Currently the IBM implementation insists on using a fixed-length encoding 
and returns an "unsupported" error message for a variable width encoding. 
With a fixed width encoding, we "do the maths" using the 
bytes-per-character and the bytes written by this complex element. 

HTH, 
Andy 
Andy Edwards - IBM Integration Bus - DFDL 

Email: 
andy.edwards at uk.ibm.com 
Snail Mail:   
MP211, Hursley park, Hursley, WINCHESTER, Hants, SO21 2JN 
Tel int: 
247222 
Tel ext: 
+44 (0)1962 817222 
Desk: 
DE3 V17

The Feynman problem solving Algorithm
1) Write down the problem
2) Think real hard
3) Write down the answer
-- Murray Gell-mann in the NY Times

Steve Hanson/UK/IBM 
24/03/2014 14:52 

To
"dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, 
cc
Mike Beckerle <mbeckerle.dfdl at gmail.com>, Andrew Edwards/UK/IBM at IBMGB 
Subject
Re: [DFDL-WG] Action 242 - valueLength and contentLength function wording
Link

Note errata 3.9, my bolding: 

"3.9. Section 12.3.5, 7.3.1, 7.3.2.  The spec originally allows lengthKind 
‘pattern’ to be used when the representation of the current element, or of 
a child element, is binary, but imposes restrictions on the encoding that 
can be in force. 

Clarify that the encoding property must be defined for the element (else 
schema definition error), and that a decoding processing error is possible 
if the match of the regex encounters data that does not decode in that 
encoding, dependent on the setting of encodingErrorPolicy. Remove section 
12.3.5.1. 

Same clarifications needed for testKind ”pattern” property for asserts and 
discriminators. 

For consistency, the restriction that a complex element of specified 
length and lengthUnits ‘characters’ must have children that are all text 
and that have the same encoding as the complex element, is dropped." 

That's the restriction that I was referring to in my comment below.  I can 
see why it was dropped - basically the parser now just tries to decode n 
characters using the complex element's encoding (and encodingErrorPolicy). 
We could apply the same principle for dfdl:valueLength & 
dfdl:contentLength - you build the stream from the bottom up, and then 
decode it using the complex element's encoding (and encodingErrorPolicy ?) 
to get the length in characters. 

Note that's how unparsing for lengthKind 'prefixed' with lengthUnits 
'characters' would work as well  - the spec just says "For a complex 
element, the length is that of the ComplexContent region" which is not 
sufficient (12.3.4). Similar deal for lengthKind 'explicit' - in order to 
know the size in chars of ElementUnused the unparser needs to know the 
size in chars of the data first (12.3.7.3). 

(Of course, for a fixed width encoding, you don't need to decode, you can 
just do the maths, but for the general case you need to decode. Also just 
doing the maths does not take encodingErrorPolicy into account). 

Regards

Steve Hanson
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848 

From:        Steve Hanson/UK/IBM 
To:        Mike Beckerle <mbeckerle.dfdl at gmail.com>, 
Cc:        "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, dfdl-wg-bounces at ogf.org 
Date:        24/03/2014 12:55 
Subject:        Re: [DFDL-WG] Action 242 - valueLength and contentLength 
function        wording 

Mike 

23.5.3.1. Value length is only a function of the dfdl:encoding property if 
the element has a text representation. Not sure this needs to be 
(re)stated here. 

23.5.3.1. "The value length is computed from the DFDL infoset value, 
ignoring the dfdl:length or dfdl:textOutputMinLength property. Other DFDL 
properties which affect the length of a text or binary representation are 
respected, it is only an explicit length which is ignored." Last sentence 
is too imprecise - should be phrased in terms of the grammar. 

23.5.3.1. "If the second argument is 'characters' then the element must 
have text representation and it is a schema definition error otherwise". 
Yes but only for a simple type, so should be qualified. 

23.5.3.1. "If the second argument, giving the length units, is 
'characters', then recursively, this complex type element must have text 
representation throughout all its contained elements and framing, all of 
which must also use a uniform character set encoding."  I can't see that 
restriction elsewhere in the spec when it talks about length of 
ComplexContent and lengthUnits 'characters' - I was expecting it to be in 
section 12.3.4 or 12.3.7.3 which face the same issue - but it isn't. Did 
we decide not to have this restriction? Without such a restriction, how 
does the unparser come up with a meaningful length (unless it re-parses)? 
(Tim - what does IBM DFDL do here?)  What about delimiters and padding of 
children that use %#r entities? 

23.5.3.2. The points in 23.5.3.1 about escape characters, length as a 
function of encoding, and bottom up for complex elements, apply equally to 
23.5.3.2.  It might be easier just to say in 23.5.3.2 that 
dfdl:contentLength for complex elements is same as dfdl:valueLength, and 
for simple elements differs only by the additional inclusion of 
LeftPadding and RightPadOrFill regions. 

Also noted in passing: 

Specified length - An item has specified length when dfdl:lengthKind is 
"implicit", "explicit", or "prefixed".   

should be 

Specified length - An element has specified length when dfdl:lengthKind is 
"implicit" (simple type only), "explicit", or "prefixed". 

Regards

Steve Hanson
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848 

From:        Mike Beckerle <mbeckerle.dfdl at gmail.com> 
To:        "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, 
Date:        20/03/2014 17:21 
Subject:        [DFDL-WG] Action 242 - valueLength and contentLength 
function        wording 
Sent by:        dfdl-wg-bounces at ogf.org 

See attached doc which is proposed revisions to section 23.5.3

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com 
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject to the OGF Intellectual Property Policy 
[attachment "Action-252-DFDL-Functions-23.5.3.docx" deleted by Andrew 
Edwards/UK/IBM] --
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

[attachment "Action-252-DFDL-Functions-23.5.3.docx" deleted by Steve 
Hanson/UK/IBM] --
 dfdl-wg mailing list
 dfdl-wg at ogf.org
 https://www.ogf.org/mailman/listinfo/dfdl-wg 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20140415/b0da8cc2/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Action-252-DFDL-Functions-23.5.3.docx
Type: application/octet-stream
Size: 37740 bytes
Desc: not available
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20140415/b0da8cc2/attachment-0001.obj>