[DFDL-WG] Fw: Action 261

Wed Jun 25 06:41:44 EDT 2014

261
Implied separatorSuppressionPolicy for occursCountKind 'expression' (All)
10/6: Spec says it is 'never' (positional sequence) but you have to parse 
to identify the position, so isn't that non-positional?
17/6: Some other issues noted around 'expression' as per email thread. IBM 
have discussed this internally and will submit a proposal.

As was noted in the email for Action 260, if it is decided that the 
meaning of "Each occurrence in the sequence can be identified by its 
position in the data" is more strictly that an observer of the raw data 
can identify an occurrence of an element in the sequence solely by 
counting separators then that would appear to make dfdl:occursCountKind 
'expression' more like 'parsed', and not eligible to be in a Positional 
sequence. But if the meaning is a parser does not have to speculate to 
identify an occurrence of an element in the sequence then it can be in a 
Positional sequence.

While discussing the nature of 'expression' it was noted that it is very 
easy for a DFDL user to create a data stream from an infoset and for that 
data stream to be un-parse-able. If dfdl:outputValueCalc is not used, then 
the element(s) in the infoset must be correctly set manually to match the 
number of occurrences. The same observation applies to dfdl:lengthKind 
'explicit' where dfdl:length is an expression. 

To address this, IBM proposes the following changes to the DFDL 
specification for occursCountKind 'expression' when unparsing:
When all occurrences have been obtained from the infoset (and defaulting 
applied if needed), the occursCount expression is evaluated
If any element that is referenced by the expression has 
dfdl:outputValueCalc, then that expression is (re-)evaluated as part of 
the above; given that the number of occurrences is now known, the 
outputValueCalc expression should now evaluate successfully
If the result of the occursCount does not match the number of occurrences 
it is a processing error
This ensures the integrity of the data stream (the non-outputValueCalc use 
case), while still allowing the infoset to dictate the count (the 
outputValueCalc use case). 

Similarly, IBM proposes the following changes to the DFDL specification 
for lengthKind 'explicit', where length is an expression, when unparsing:
When an element has been obtained from the infoset (and defaulting applied 
if needed) and unparsed but before padding is applied, the expression is 
evaluated to give a length.
If any element that is referenced by the expression has 
dfdl:outputValueCalc, then that expression is (re-)evaluated; given that 
the unpadded length is now known, the outputValueCalc expression should 
now evaluate successfully
Now that we have a length, the unparser behaviour is the same as if the 
length was obtained from a fixed dfdl:length value. (That is, truncation, 
padding, filling or error depending on other property settings). 
This ensures the integrity of the data stream (the non-outputValueCalc use 
case), while still allowing the infoset to dictate the length (the 
outputValueCalc use case). Note that it means the 'awkward' behaviour 
whereby lengthKind 'explicit' (expression) is a specified length when 
parsing but variable length when unparsing is avoided - it is now always 
specified length.

IBM believes that none of the above should impose additional burden on 
implementers, as no brand new behaviour is being added. 

Regards

Steve Hanson
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 25/06/2014 10:24 -----

From:   Tim Kimber/UK/IBM
To:     Steve Hanson/UK/IBM at IBMGB, Alex Wood1/UK/IBM at IBMGB, Andrew 
Edwards/UK/IBM at IBMGB, Mark Frost/UK/IBM at IBMGB, 
Date:   16/06/2014 22:08
Subject:        Re: [DFDL-WG] Action 261

I think this needs to be discussed before the meeting tomorrow. I wanted 
to avoid turning this thread into a monster - but I don't think it's 
possible to discuss without the context so I've continued the thread as 
before with <tk> tags. 

I'm clear in my own mind that I understand the issues now.  We should give 
priority to the separatorSuppressionPolicy question because the current 
rules are close to being unimplementable. The occursCountKind and 
lengthKind questions are important but are at least implementable as they 
stand.

regards,

Tim Kimber, 
IBM Integration Bus Development (Industry Packs)
Hursley, UK
Internet:  kimbert at uk.ibm.com
Tel. 01962-816742 
Internal tel. 37246742

From:   Steve Hanson/UK/IBM
To:     Tim Kimber/UK/IBM at IBMGB, 
Cc:     dfdl-wg at ogf.org, dfdl-wg-bounces at ogf.org
Date:   11/06/2014 16:12
Subject:        Re: [DFDL-WG] Action 261

Replies in <smh> tags

Regards

Steve Hanson
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848

From:   Tim Kimber/UK/IBM at IBMGB
To:     dfdl-wg at ogf.org, 
Date:   11/06/2014 13:58
Subject:        Re: [DFDL-WG] Action 261
Sent by:        dfdl-wg-bounces at ogf.org

comments in <tk>tags 
regards,

Tim Kimber, 
IBM Integration Bus Development (Industry Packs)
Hursley, UK
Internet:  kimbert at uk.ibm.com
Tel. 01962-816742 
Internal tel. 37246742

From:        Steve Hanson/UK/IBM 
To:        Tim Kimber/UK/IBM at IBMGB, 
Cc:        dfdl-wg at ogf.org, dfdl-wg-bounces at ogf.org 
Date:        11/06/2014 10:47 
Subject:        Re: [DFDL-WG] Action 261 

Some thoughts on this... 

I agree that the definition of positional sequence in the spec needs 
tightening as it is ambiguous as it stands and could be interpreted as a) 
or b).  If we adopted b) then that would appear to allow 'expression' to 
appear in a positional sequence, but wouldn't it also allow 'stopValue'? 
<tk>Yes - according to definition b) stopValue would be allowable in a 
positional sequence. We could still disallow it if we do not believe there 
is any benefit in allowing it. I don't believe it introduces any 
particular complexities for an implementer.</tk> 

occursCountKind 'expression' is analogous to lengthKind 'explicit' with an 
expression and to lengthKind 'prefixed'. Both these lengthKinds are 
classified as 'specified length' when parsing but 'variable length' when 
unparsing. We are observing that occursCountKind 'expression' is like 
'fixed' when parsing but not quite so like 'fixed' when unparsing - which 
is why section 16 groups 'expression' with 'parsed' for unparsing. 
<tk>Yes - we took a decision that the unparser should ignore the 
expression in lengthKind/occursCountKind, and just output whatever data 
happens to be in the info set. I'm not sure that it saves a lot of effort 
in the implementation and it certainly is not easy to justify as a 
consistent behaviour. For me, the unparser should treat 
lengthKind='explicit' the same way whether the value is static or 
calculated. And the unparser should treat occursCountKind='expression' the 
same way as occursCountKind='fixed'. </tk> 

When unparsing occursCountKind 'expression' you don't always have the 
calculated array length N. If the infoset was derived from XML, there is 
likely no 'count' element, just a bunch of elements with the same name 
that make up the 'array'. DFDL gives you the choice whether to a) manually 
set the count element, or b) have the unparser set it automatically via 
outputValueCalc. In the former case, you can create a document that can 
not be parsed; the unparser could check the 'count' element matches the 
infoset, but that would involve reverse engineering an arbitrarily complex 
expression and is why the specification does not say that. 
<tk>It would involve evaluating the expression. In most cases, that will 
not require any lookahead because the Length/Count field will precede the 
array or element. Not sure where the reverse engineering comes in?</tk> 
<smh>I see what you are saying. Just evaluate the expression and see what 
it gives for N. That handles case a) but not b) where I explicitly want 
the unparser to set the count via outputValueCalc - which is presumably 
referring to the number of elements in the array, which is not known. For 
case b) N has to be the number in the infoset. Given that we have to 
support case b) the unparser can not treat occursCountKind 'expression' 
exactly the same as 'fixed' when unparsing.</smh>

<smh>Similarly with lengthKind 'explicit' with an expression. For the 
equivalent to case a) the length is known which makes the length fixed, 
but for the equivalent of case b) with outputValueCalc the length is not 
known so it is variable. When this was discussed in the past, it was 
decided not to bifurcate the expression scenario. Hence the spec is the 
way it is. </smh>
<tk>
That helps. So your belief is that case a) is workable but case b) is not 
because the number of elements in the array is not known. I don't think 
that holds up under scrutiny.
In case b), the outputValueCalc expression cannot be evaluated until all 
of the array has been received. So if the info set is received as an event 
stream then the unparser must wait until the array completes before 
evaluating the expression. Note that the 'count' ( or 'length' ) field 
cannot be serialized until its value is known. But the array ( or 
variable-length field ) comes *after* the count/length field. By the time 
unparsing of the array/field begins, the value *is* known.

The implementation of case a) is actually less straightforward. If the 
field is an array length and the value is less than the number of items in 
the info set then I think the unparser must issue an error. If greater 
then the unparser could output default values ( if available ) or 
delimiters ( if the parent sequence is a positional sequence with a 
delimiter ) or else an error if neither are possible. More simply, the 
unparser could simply insist that the value must correctly describe the 
data, and I think that's a reasonable rule.
Similarly for the length. If the length of the *unparsed* value is greater 
than the value in the info set then the unparser should issue an error. If 
it is shorter and the field is simple then pad characters could be added. 
But again, I think real-world usage of length counts suggests that padding 
is unlikely to be wanted, and the unparser should simply insist on the 
length field being correct.
</tk>

Here's a real example of such an expression (albeit with lengthKind 
'explicit' but the principle is the same): 

        dfdl:length="{xs:nonNegativeInteger(fn:floor((../Length + 1) div 
2))}" 

Alex brought up the case where the expression evaluates to 0. In a 
positional sequence, would you still expect a delimiter for this case?   
<tk>Yes, unless it is in the trailing optional region of the group and 
SSP='trailingEmpty'. In a positional sequence, every delimiter must be 
present until suppression begins ( if allowed )</tk> 

If 'yes' then the resultant zero length string must be treated as the 
'absent representation' and ignored. If 'no' then is the sequence still 
positional? 
<tk>I don't understand the point. Why would it not be the 'empty 
representation'? Why must it be 'ignored' if it does happen to be the 
'absent representation'? What does 'ignored' mean?</tk>
<smh>The point is that the parser has been told there are 0 occurrences. 
So it would be odd if the infoset ended up containing an occurrence, which 
can happen if the normal nil/empty rules are followed. (Eg, 
nilValue=%ES;). 
<tk>
If the occursCount expression evaluates to zero then the parser will not 
attempt to parse even one occurrence of the array. That's the natural 
meaning of 'zero occurrences'. So nothing would go into the info set apart 
from the 'count' field. This is entirely consistent with my definition b) 
of 'positional' ( the identity of every delimited field is known before 
parsing of the field begins ).
</tk>
Hence the 0 occurrence case must treat it as absent which means nothing is 
added to the infoset. Take the ISO8583 bitmap use case - if the bit is 0 
we must not try to parse anything at all for that element - it is totally 
absent.</smh>
<tk>Yes - that's exactly my point. The fact that there is ( or could be ) 
a delimiter after the 'count' field is irrelevant.</tk>

Regards

Steve Hanson
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848 

From:        Tim Kimber/UK/IBM at IBMGB 
To:        dfdl-wg at ogf.org, 
Date:        10/06/2014 21:22 
Subject:        [DFDL-WG] Action 261 
Sent by:        dfdl-wg-bounces at ogf.org 

 Implied separatorSuppressionPolicy for occursCountKind 'expression ' 
(All) 
10/6: Spec says it is 'never' (positional sequence) but you have to parse 
to identify the position, so isn't that non-positional? 

I think there are two alternative definitions of 'positional': 
a) the identity of every delimited field is known before parsing of the 
sequence group begins 
b) the identity of every delimited field is known before parsing of the 
field begins 

As an implementer, b) is sufficient because it means that the parser never 
needs to backtrack while parsing the group. 
a) allows the field identities to be statically known, but that is less 
important - it does not allow optimised extraction of a particular field 
as would be the case for a fixed-length group ( the possibility of escaped 
separators/terminators means that every character will need to be scanned 
anyway ). 

It may sound like a small point, but it affects two decisions 
1. whether ock='expression' should be allowed within a positional sequence 
group ( action 261 ) 
2. what the behaviour of the unparser should be w.r.t. ock='expression'. 

My own feeling is that ock='expression' should be treated almost exactly 
like ock='fixed', except that the calculated array length N is used 
instead of maxOccurs. 
- When parsing a positional sequence group it should cause N delimiters to 
be expected for the array. 
- When unparsing a positional sequence group it should cause N delimiters 
to be written. 
These rules are consistent and straightforward to describe and implement. 
The current rule ( unparser outputs the occurrences that are in the info 
set only ) allows the unparser to write a document that cannot be parsed 
using the same schema. 

regards,

Tim Kimber, 

----- Forwarded by Tim Kimber/UK/IBM on 10/06/2014 20:34 ----- 

From:        Steve Hanson/UK/IBM at IBMGB 
To:        dfdl-wg at ogf.org, 
Date:        10/06/2014 17:57 
Subject:        [DFDL-WG] OGF DFDL WG Call Minutes 2014-06-10 
Sent by:        dfdl-wg-bounces at ogf.org 

Please find minutes from the above call at 
http://redmine.ogf.org/dmsf_files/13263?download= 

Regards

Steve Hanson
Architect, IBM DFDL,
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848 --
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
--
 dfdl-wg mailing list
 dfdl-wg at ogf.org
 https://www.ogf.org/mailman/listinfo/dfdl-wg 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20140625/fbd5c7b4/attachment-0001.html>