[DFDL-WG] Fw: Action 261

Fri Aug 29 17:20:33 EDT 2014

I also reviewed this.

I like the symmetry improvement of OCK expression unparsing being 'never'
suppression policy and taking the occurs count from the number of items in
the augmented infoset.

I don't like the evaluation of occurs expression on unparsing as a check,
and length expression evaluation on unparsing as a check. This allows
unparsing of data that cannot be parsed again. But there are many many such
holes, and we can't plug them all.

We don't evaluate assert/discriminator tests when unparsing, so why these
other checks?

(Note: I just did a quick search through the spec for all uses of "assert",
and didn't find a statement that says they are only evaluated when parsing.)

If we want to give the option for unparsing to evaluate occurs and length
expressions (and assets/discrims too?) that's worth consideration, but I'd
prefer that this be an implementation-specific flag/mode and not part of
the standard (for now.)

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
<http://www.ogf.org/About/abt_policies.php>

On Thu, Aug 28, 2014 at 10:13 AM, Steve Hanson <smh at uk.ibm.com> wrote:

> Please review for Tuesday's WG call...
>
> Regards
>
> Steve Hanson
> Architect, *IBM DFDL*
> <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> IBM SWG, Hursley, UK
> *smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
> ----- Forwarded by Steve Hanson/UK/IBM on 28/08/2014 15:01 -----
>
> From:        Steve Hanson/UK/IBM
> To:        dfdl-wg at ogf.org,
> Date:        06/08/2014 13:50
> Subject:        Fw: [DFDL-WG] Action 261
> ------------------------------
>
>
> In the spec as it stands, occursCountKind 'expression' has behaviour like
> occursCountKind 'fixed' when parsing (ie, there is a count that is fixed
> before the occurrences are parsed), but is also stated to have behaviour
> like occursCountKind 'parsed' when unparsing (ie, there is no count that is
> fixed before occurrences are unparsed).  In terms of expected separators,
> it behaves like 'never' when parsing, but like 'anyEmpty' when unparsing.
> This leads to undesirable asymmetric behaviour, such as this example.
>
> Data stream is 5,A,B,,D,E. To preserve indexes in the infoset after
> parsing, I make the array element nillable with ES as nil value. But on
> unparsing I will get 5,A,B,D,E because the 'anyEmpty' behaviour has
> suppressed the output of the separator for the zero-length nil value. Not
> only will modellers find this behaviour odd, it breaks dfdl:outputValueCalc
> on the count element if fn:count() is used. (Same happens if not nillable
> but minOccurs > 2 so empty string ends up in the infoset).
>
> The proposal is that 'expression' behaves like 'never' on unparsing, the
> count being taken as *the number of items in the augmented infoset*. That
> way the application or outputValueCalc can be certain that no separators
> will be suppressed, and outputValueCalc will work.
>
> It was proposed further down this email thread that the expression is
> evaluated at the start of unparsing the element and used to obtain the
> count. There are scenarios where that causes problems though, hence the
> above is preferred. However, to ensure that the output data stream can be
> re-parsed, it is proposed that the expression is evaluated at the *end*
> of the unparsing of the element (and any elements with expressions
> dependent on it), and if it fails to match the number that was output, it
> is a processing error.
>
> Similarly for lengthKind 'explicit' where length is an expression. The
> element continues to be considered variable length on unparsing as the spec
> says today, but the expression is evaluated at the end of the unparsing of
> the element, and if it fails to match the length that was output, it is a
> processing error.
>
> It would be great to do the analogous thing for lengthKind 'pattern' but
> it is not always possible as some patterns look ahead beyond the element
> content in order to match.  So it could be done for some patterns but not
> all. Implementation-dependent?
>
> Regards
>
> Steve Hanson
> Architect, *IBM DFDL*
> <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> IBM SWG, Hursley, UK
> *smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
> ----- Forwarded by Steve Hanson/UK/IBM on 06/08/2014 13:08 -----
> From:        Steve Hanson/UK/IBM
> To:        dfdl-wg at ogf.org,
> Date:        25/06/2014 11:33
> Subject:        Fw: [DFDL-WG] Action 261
> ------------------------------
>
>
>   *261*
> *Implied separatorSuppressionPolicy for occursCountKind 'expression' (All)*
> 10/6: Spec says it is 'never' (positional sequence) but you have to parse
> to identify the position, so isn't that non-positional?
> 17/6: Some other issues noted around 'expression' as per email thread. IBM
> have discussed this internally and will submit a proposal.
>
> As was noted in the email for Action 260, if it is decided that the
> meaning of "*Each occurrence in the sequence can be identified by its
> position in the data*" is more strictly that *an observer of the raw data
> can identify an occurrence of an element in the sequence solely by counting
> separators *then that would appear to make dfdl:occursCountKind
> 'expression' more like 'parsed', and not eligible to be in a Positional
> sequence. But if the meaning is *a parser does not have to speculate to
> identify an occurrence of an element in the sequence* then it can be in a
> Positional sequence.
>
> While discussing the nature of 'expression' it was noted that it is very
> easy for a DFDL user to create a data stream from an infoset and for that
> data stream to be un-parse-able. If dfdl:outputValueCalc is not used, then
> the element(s) in the infoset must be correctly set manually to match the
> number of occurrences. The same observation applies to dfdl:lengthKind
> 'explicit' where dfdl:length is an expression.
>
> To address this, IBM proposes the following changes to the DFDL
> specification for occursCountKind 'expression' when unparsing:
>
>    - When all occurrences have been obtained from the infoset (and
>    defaulting applied if needed), the occursCount expression is evaluated
>    - If any element that is referenced by the expression has
>    dfdl:outputValueCalc, then that expression is (re-)evaluated as part of the
>    above; given that the number of occurrences is now known, the
>    outputValueCalc expression should now evaluate successfully
>    - If the result of the occursCount does not match the number of
>    occurrences it is a processing error
>
> This ensures the integrity of the data stream (the non-outputValueCalc use
> case), while still allowing the infoset to dictate the count (the
> outputValueCalc use case).
>
> Similarly, IBM proposes the following changes to the DFDL specification
> for lengthKind 'explicit', where length is an expression, when unparsing:
>
>    - When an element has been obtained from the infoset (and defaulting
>    applied if needed) and unparsed but before padding is applied, the
>    expression is evaluated to give a length.
>    - If any element that is referenced by the expression has
>    dfdl:outputValueCalc, then that expression is (re-)evaluated; given that
>    the unpadded length is now known, the outputValueCalc expression should now
>    evaluate successfully
>    - Now that we have a length, the unparser behaviour is the same as if
>    the length was obtained from a fixed dfdl:length value. (That is,
>    truncation, padding, filling or error depending on other property
>    settings).
>
> This ensures the integrity of the data stream (the non-outputValueCalc use
> case), while still allowing the infoset to dictate the length (the
> outputValueCalc use case). Note that it means the 'awkward' behaviour
> whereby lengthKind 'explicit' (expression) is a specified length when
> parsing but variable length when unparsing is avoided - it is now always
> specified length.
>
> IBM believes that none of the above should impose additional burden on
> implementers, as no brand new behaviour is being added.
>
> Regards
>
> Steve Hanson
> Architect, *IBM DFDL*
> <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> IBM SWG, Hursley, UK
> *smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
> ----- Forwarded by Steve Hanson/UK/IBM on 25/06/2014 10:24 -----
>
>
>
>
> From:        Tim Kimber/UK/IBM
> To:        Steve Hanson/UK/IBM at IBMGB, Alex Wood1/UK/IBM at IBMGB, Andrew
> Edwards/UK/IBM at IBMGB, Mark Frost/UK/IBM at IBMGB,
> Date:        16/06/2014 22:08
> Subject:        Re: [DFDL-WG] Action 261
> ------------------------------
>
>
> I think this needs to be discussed before the meeting tomorrow. I wanted
> to avoid turning this thread into a monster - but I don't think it's
> possible to discuss without the context so I've continued the thread as
> before with <tk> tags.
>
> I'm clear in my own mind that I understand the issues now.  We should give
> priority to the separatorSuppressionPolicy question because the current
> rules are close to being unimplementable. The occursCountKind and
> lengthKind questions are important but are at least implementable as they
> stand.
>
> regards,
>
> Tim Kimber,
> IBM Integration Bus Development (Industry Packs)
> Hursley, UK
> Internet:  kimbert at uk.ibm.com
> Tel. 01962-816742
> Internal tel. 37246742
>
>
>
>
>
> From:        Steve Hanson/UK/IBM
> To:        Tim Kimber/UK/IBM at IBMGB,
> Cc:        dfdl-wg at ogf.org, dfdl-wg-bounces at ogf.org
> Date:        11/06/2014 16:12
> Subject:        Re: [DFDL-WG] Action 261
> ------------------------------
>
>
> Replies in <smh> tags
>
> Regards
>
> Steve Hanson
> Architect, *IBM DFDL*
> <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> IBM SWG, Hursley, UK
> *smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
>
>
>
>
> From:        Tim Kimber/UK/IBM at IBMGB
> To:        dfdl-wg at ogf.org,
> Date:        11/06/2014 13:58
> Subject:        Re: [DFDL-WG] Action 261
> Sent by:        dfdl-wg-bounces at ogf.org
> ------------------------------
>
>
>
> comments in <tk>tags
> regards,
>
> Tim Kimber,
> IBM Integration Bus Development (Industry Packs)
> Hursley, UK
> Internet:  kimbert at uk.ibm.com
> Tel. 01962-816742
> Internal tel. 37246742
>
>
>
>
> From:        Steve Hanson/UK/IBM
> To:        Tim Kimber/UK/IBM at IBMGB,
> Cc:        dfdl-wg at ogf.org, dfdl-wg-bounces at ogf.org
> Date:        11/06/2014 10:47
> Subject:        Re: [DFDL-WG] Action 261
>  ------------------------------
>
>
> Some thoughts on this...
>
> I agree that the definition of positional sequence in the spec needs
> tightening as it is ambiguous as it stands and could be interpreted as a)
> or b).  If we adopted b) then that would appear to allow 'expression' to
> appear in a positional sequence, but wouldn't it also allow 'stopValue'?
> <tk>Yes - according to definition b) stopValue would be allowable in a
> positional sequence. We could still disallow it if we do not believe there
> is any benefit in allowing it. I don't believe it introduces any particular
> complexities for an implementer.</tk>
>
> occursCountKind 'expression' is analogous to lengthKind 'explicit' with an
> expression and to lengthKind 'prefixed'. Both these lengthKinds are
> classified as 'specified length' when parsing but 'variable length' when
> unparsing. We are observing that occursCountKind 'expression' is like
> 'fixed' when parsing but not quite so like 'fixed' when unparsing - which
> is why section 16 groups 'expression' with 'parsed' for unparsing.
> <tk>Yes - we took a decision that the unparser should ignore the
> expression in lengthKind/occursCountKind, and just output whatever data
> happens to be in the info set. I'm not sure that it saves a lot of effort
> in the implementation and it certainly is not easy to justify as a
> consistent behaviour. For me, the unparser should treat
> lengthKind='explicit' the same way whether the value is static or
> calculated. And the unparser should treat occursCountKind='expression'
> the same way as occursCountKind='fixed'. </tk>
>
> When unparsing occursCountKind 'expression' you don't always have the
> calculated array length N. If the infoset was derived from XML, there is
> likely no 'count' element, just a bunch of elements with the same name that
> make up the 'array'. DFDL gives you the choice whether to a) manually set
> the count element, or b) have the unparser set it automatically via
> outputValueCalc. In the former case, you can create a document that can not
> be parsed; the unparser could check the 'count' element matches the
> infoset, but that would involve reverse engineering an arbitrarily complex
> expression and is why the specification does not say that.
> <tk>It would involve evaluating the expression. In most cases, that will
> not require any lookahead because the Length/Count field will precede the
> array or element. Not sure where the reverse engineering comes in?</tk>
> <smh>I see what you are saying. Just evaluate the expression and see what
> it gives for N. That handles case a) but not b) where I explicitly want the
> unparser to set the count via outputValueCalc - which is presumably
> referring to the number of elements in the array, which is not known. For
> case b) N has to be the number in the infoset. Given that we have to
> support case b) the unparser can not treat occursCountKind 'expression'
> exactly the same as 'fixed' when unparsing.</smh>
>
> <smh>Similarly with lengthKind 'explicit' with an expression. For the
> equivalent to case a) the length is known which makes the length fixed, but
> for the equivalent of case b) with outputValueCalc the length is not known
> so it is variable. When this was discussed in the past, it was decided not
> to bifurcate the expression scenario. Hence the spec is the way it is.
> </smh>
> <tk>
> That helps. So your belief is that case a) is workable but case b) is not
> because the number of elements in the array is not known. I don't think
> that holds up under scrutiny.
> In case b), the outputValueCalc expression cannot be evaluated until all
> of the array has been received. So if the info set is received as an event
> stream then the unparser must wait until the array completes before
> evaluating the expression. Note that the 'count' ( or 'length' ) field
> cannot be serialized until its value is known. But the array ( or
> variable-length field ) comes *after* the count/length field. By the time
> unparsing of the array/field begins, the value *is* known.
>
> The implementation of case a) is actually less straightforward. If the
> field is an array length and the value is less than the number of items in
> the info set then I think the unparser must issue an error. If greater then
> the unparser could output default values ( if available ) or delimiters (
> if the parent sequence is a positional sequence with a delimiter ) or else
> an error if neither are possible. More simply, the unparser could simply
> insist that the value must correctly describe the data, and I think that's
> a reasonable rule.
> Similarly for the length. If the length of the *unparsed* value is greater
> than the value in the info set then the unparser should issue an error. If
> it is shorter and the field is simple then pad characters could be added.
> But again, I think real-world usage of length counts suggests that padding
> is unlikely to be wanted, and the unparser should simply insist on the
> length field being correct.
> </tk>
>
> Here's a real example of such an expression (albeit with lengthKind
> 'explicit' but the principle is the same):
>
>        dfdl:length="{xs:nonNegativeInteger(fn:floor((../Length + 1) div
> 2))}"
>
> Alex brought up the case where the expression evaluates to 0. In a
> positional sequence, would you still expect a delimiter for this case?
> <tk>Yes, unless it is in the trailing optional region of the group and
> SSP='trailingEmpty'. In a positional sequence, every delimiter must be
> present until suppression begins ( if allowed )</tk>
>
> If 'yes' then the resultant zero length string must be treated as the
> 'absent representation' and ignored. If 'no' then is the sequence still
> positional?
> <tk>I don't understand the point. Why would it not be the 'empty
> representation'? Why must it be 'ignored' if it does happen to be the
> 'absent representation'? What does 'ignored' mean?</tk>
> <smh>The point is that the parser has been told there are 0 occurrences.
> So it would be odd if the infoset ended up containing an occurrence, which
> can happen if the normal nil/empty rules are followed. (Eg, nilValue=%ES;).
> <tk>
> If the occursCount expression evaluates to zero then the parser will not
> attempt to parse even one occurrence of the array. That's the natural
> meaning of 'zero occurrences'. So nothing would go into the info set apart
> from the 'count' field. This is entirely consistent with my definition b)
> of 'positional' ( *the identity of every delimited field is known before
> parsing of the field begins* ).
> </tk>
> Hence the 0 occurrence case must treat it as absent which means nothing is
> added to the infoset. Take the ISO8583 bitmap use case - if the bit is 0
> we must not try to parse anything at all for that element - it is totally
> absent.</smh>
> <tk>Yes - that's exactly my point. The fact that there is ( or could be )
> a delimiter after the 'count' field is irrelevant.</tk>
>
> Regards
>
> Steve Hanson
> Architect, *IBM DFDL*
> <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> IBM SWG, Hursley, UK
> *smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
>
>
>
>
> From:        Tim Kimber/UK/IBM at IBMGB
> To:        dfdl-wg at ogf.org,
> Date:        10/06/2014 21:22
> Subject:        [DFDL-WG] Action 261
> Sent by:        dfdl-wg-bounces at ogf.org
>  ------------------------------
>
>
>
> * Implied separatorSuppressionPolicy for occursCountKind 'expression '
> (All)*
> * 10/6: Spec says it is 'never' (positional sequence) but you have to
> parse to identify the position, so isn't that non-positional?*
>
> I think there are two alternative definitions of 'positional':
> a) the identity of every delimited field is known before parsing of the
> sequence group begins
> b) the identity of every delimited field is known before parsing of the
> field begins
>
> As an implementer, b) is sufficient because it means that the parser never
> needs to backtrack while parsing the group.
> a) allows the field identities to be statically known, but that is less
> important - it does not allow optimised extraction of a particular field as
> would be the case for a fixed-length group ( the possibility of escaped
> separators/terminators means that every character will need to be scanned
> anyway ).
>
> It may sound like a small point, but it affects two decisions
> 1. whether ock='expression' should be allowed within a positional sequence
> group ( action 261 )
> 2. what the behaviour of the unparser should be w.r.t. ock='expression'.
>
> My own feeling is that ock='expression' should be treated almost exactly
> like ock='fixed', except that the calculated array length N is used instead
> of maxOccurs.
> - When parsing a positional sequence group it should cause N delimiters to
> be expected for the array.
> - When unparsing a positional sequence group it should cause N delimiters
> to be written.
> These rules are consistent and straightforward to describe and implement.
> The current rule ( unparser outputs the occurrences that are in the info
> set only ) allows the unparser to write a document that cannot be parsed
> using the same schema.
>
> regards,
>
> Tim Kimber,
>
> ----- Forwarded by Tim Kimber/UK/IBM on 10/06/2014 20:34 -----
>
> From:        Steve Hanson/UK/IBM at IBMGB
> To:        dfdl-wg at ogf.org,
> Date:        10/06/2014 17:57
> Subject:        [DFDL-WG] OGF DFDL WG Call Minutes 2014-06-10
> Sent by:        dfdl-wg-bounces at ogf.org
>  ------------------------------
>
>
>
> Please find minutes from the above call at
> *http://redmine.ogf.org/dmsf_files/13263?download=*
> <http://redmine.ogf.org/dmsf_files/13263?download=>
>
> Regards
>
> Steve Hanson
> Architect, IBM DFDL,
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> IBM SWG, Hursley, UK
> *smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848 --
> dfdl-wg mailing list
> dfdl-wg at ogf.org
> *https://www.ogf.org/mailman/listinfo/dfdl-wg*
> <https://www.ogf.org/mailman/listinfo/dfdl-wg>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
> --
> dfdl-wg mailing list
> dfdl-wg at ogf.org
> *https://www.ogf.org/mailman/listinfo/dfdl-wg*
> <https://www.ogf.org/mailman/listinfo/dfdl-wg>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
> --
>  dfdl-wg mailing list
>  dfdl-wg at ogf.org
>  https://www.ogf.org/mailman/listinfo/dfdl-wg
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
> --
>   dfdl-wg mailing list
>   dfdl-wg at ogf.org
>   https://www.ogf.org/mailman/listinfo/dfdl-wg
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20140829/ed45d371/attachment-0001.html>