[DFDL-WG] proposed wording: scanable and 'results are not predictable' improvement - was Fwd: issue: scannable and 'results are not predictable'

Wed Jul 24 10:38:24 EDT 2013

Typo fixed below (dot was not edited out).

On Tue, Jul 23, 2013 at 7:11 PM, Mike Beckerle <mbeckerle.dfdl at gmail.com>wrote:

>
> The upshot of this whole thread is that we need to fix the description of
> testPattern for asserts/discriminators. That motivates another correction
> in the prose description of encodingErrorPolicy.
>
> Proposed rewording:
>
> This paragraph in the testPattern description for asserts/discriminators:
>
>
>    - In order for a testPattern to be used, the data subject to the
>    pattern must be scannable using a DFDL regular expression otherwise the
>    results are not predictable.
>
> Change to:
>
>    - In order for a testPattern to be used, the data subject to the
>    pattern must be scannable using a DFDL regular expression. If the pattern
>    regular expression reads data that cannot be decoded into characters of the
>    current encoding, then the behavior is controlled by the
>    dfdl:encodingErrorPolicy property. See Section 11.2.1    Property
>    dfdl:encodingErrorPolicy for details.
>
> In addition, consider the paragraph in section 11.2.1.2
>
>    -
>
>    The Unicode Replacement Character must not appear in any delimiter,
>    pad character, nil value, regular expression, number pattern or calendar
>    pattern, or in any other DFDL property value where the Unicode Replacement
>    Character would be expected in the data being parsed. It is a schema
>    definition error if the Unicode Replacement Character appears in any of
>    these locations of a DFDL schema, or is part of the value of an expression
>    that returns a string to be used as the value of a DFDL property.
>
> I believe the above paragraph is a mistake. It precludes a very useful
> technique which is to use a negated character class in the regex like
> [^\uFFFD] This regex searches for any character except the unicode
> replacement character which is very useful. I suggest the above paragraph
> be dropped.
>
>
> This sentence (same section) can be modified:
>
>
>    - Schema authors are advised that bounded length regular expressions
>    can help in this case. E.g., ".{0,50}" says to match any character
>    (including Unicode Replacement Characters), but only up to length 50.
>
> Change to:
>
>    - Schema authors are advised that bounded length regular expressions
>    and negated character classes can improve the schema. E.g.,
>    "[^\uFFFD]{0,50}" says to match any character (*excluding *specifically
>    Unicode Replacement Characters U+FFFD), but only up to length 50.
>
>
> ---------- Forwarded message ----------
> From: Steve Hanson <smh at uk.ibm.com>
> Date: Tue, Jul 23, 2013 at 9:02 AM
> Subject: Re: [DFDL-WG] issue: scannable and 'results are not predictable'
> To: Mike Beckerle <mbeckerle.dfdl at gmail.com>
> Cc: dfdl-wg at ogf.org, dfdl-wg-bounces at ogf.org, Tim Kimber <
> KIMBERT at uk.ibm.com>
>
>
> Mike, would you like to attempt some words to improve the 'results not
> predictable' sentence?
>
> Regards
>
> Steve Hanson
> Architect, IBM Data Format Description Language (DFDL)
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> IBM SWG, Hursley, UK*
> **smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
>
>
>
> From:        Mike Beckerle <mbeckerle.dfdl at gmail.com>
> To:        Steve Hanson/UK/IBM at IBMGB,
> Cc:        Tim Kimber/UK/IBM at IBMGB, dfdl-wg at ogf.org,
> dfdl-wg-bounces at ogf.org
> Date:        11/07/2013 18:18
> Subject:        Re: [DFDL-WG] issue: scannable and 'results are not
> predictable'
> ------------------------------
>
>
>
>
> Ok. let me summarize. I think this is clear now:
>
> lengthKind pattern requires everything to be statically known to be text.
> Fake binary data as iso-8859-1 text since that passes all bytes.
>
> assert pattern just tries to decode in current encoding. Binary data might
> cause decode errors or might not depending on what is in the actual data.
> Use encoding iso-8859-1 to preclude this possibility, or
> user-beware/know-the-data.
>
> The behavior of decode errors is controlled by encodingErrorPolicy, which
> clearly states that if the policy is 'error' then a processing error is
> issued. It specifically states that this applies in all situations
> including lengthKind pattern, and pattern asserts. The description does not
> leave any wiggle room here.
>
> (There's a separate email thread on asserts with expressions that get
> errors but it does not yet discuss pattern asserts.)
>
> So the wording that says "results are not predictable" should instead
> explain and provide a reference to the description of encodingErrorPolicy.
>
>
> On Thu, Jul 11, 2013 at 5:57 AM, Steve Hanson <*smh at uk.ibm.com*<smh at uk.ibm.com>>
> wrote:
> In the original DFDL 1.0 spec this is what we used to say about lengthKind
> 'pattern'.
> *12.3.5.1        Pattern-Based Lengths  - Scanability*
>
> *Any element (complex, simple text, simple binary) may have a
> dfdl:lengthKind 'pattern' as long as the bytes in the content region of the
> element **are legal in the stated encoding of that element. Where a
> complex element has children with binary representation in practice this
> means an 8-bit ASCII encoding.
>
> Binary data can be handled by way of treating it as text with
> encoding='iso-8859-1'. In this case the text is interpreted as in the
> iso-8859-1 character encoding, and the correspondence of byte values in the
> data to a string in the DFDL infoset is one to one. That is, byte with
> value N, produces an infoset character with character code N.*
>
> This was changed by errata 3.9 back in the 4th revision of the Errata
> document. At the time, the same limit was applied to asserts &
> discriminators as well. Here is the original errata wording.
> *
> 3.9.** Section 12.3.5.1. The spec currently allows lengthKind ‘pattern’
> to be used when the representation of the current element, or of a child
> element, is binary, but imposes restrictions on the encoding that can be in
> force. However encoding is not necessarily examined for binary elements, so
> this would introduce another reason for needing encoding.*
> *
> Change the spec so that lengthKind ‘pattern’ is only applicable **
> o        **elements of simple type with representation 'text'* *
> o        **elements of complex type *
> *
> For an element of complex type:* *
> 1.        all simple child elements must have representation 'text' and
> have the same encoding as the parent complex element, and* *
> 2.        all complex child elements must themselves follow 1 and 2
> (recursively). *
> *
> Similar wording to apply to dfdl:assert testKind="pattern" in section
> 7.3.1.*
>
> In the 11th revision of the Errata document, the last sentence was changed
> to...
> *
> Note that the same restrictions do not** apply to testKind="pattern" on
> asserts and discriminators*
>
> This was done because an assert or discriminator with testKind 'pattern'
> is peeking ahead into the data stream from the start of the representation
> of the object (element / sequence / choice). This was recorded by action
> 190.
>
>   *190*
> *Clarify rules for assert/discriminator testKind 'pattern' (All)* *
> 23/10: Need to be clear on data position and whether it is just for text
> representations.  * *
> 30/10: Closed.** To comply with the timing rules being proposed in action
> 186, where these things are executed first before a 'format' annotation,
> the data position must be the beginning of the representation (note warning
> useful when alignment present). As these things can be used on various
> objects, the only rule regarding text is that dfdl:encoding must have a
> value in scope. Errata taken**.*
>
>
> Personally I am happy for DFDL 1.0 to stick with the current errata, and
> improve the wording in the testPattern description.
>
> Regards
>
> Steve Hanson
> Architect, IBM Data Format Description Language (DFDL)
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> IBM SWG, Hursley, UK*
> **smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:*+44-1962-815848* <%2B44-1962-815848>
>
>
>
> From:        Tim Kimber/UK/IBM at IBMGB
> To:        *dfdl-wg at ogf.org* <dfdl-wg at ogf.org>,
> Date:        11/07/2013 10:08
> Subject:        Re: [DFDL-WG] issue: scannable and 'results are not
> predictable'
> Sent by:        *dfdl-wg-bounces at ogf.org* <dfdl-wg-bounces at ogf.org>
> ------------------------------
>
>
>
> There was a time when we disallowed lengthKind='delimited' when
> representation is 'binary'. Binary data can, in general, contain any
> sequence of bytes so it might contain the terminating markup. In other
> words it is not guaranteed to be 'scannable'. We relaxed that rule because
> we found that there are industry formats out there which contain non-text
> delimited fields. In other words, the general rule ( binary data is not
> scannable ) does not always apply in specific formats.
>
> I think that point is relevant to this discussion. Just because the DFDL
> properties indicate that the data is not *guaranteed* to be scannable, that
> does not mean that the actual data is not scannable. I believe we should
> - define the term 'scannable'
> - acknowledge that when a complex type is not 'scannable' according to the
> definition, the data still might be parse-able in a reliable way
> - not prohibit the use of lengthKind='pattern' ( i.e. not issue an SDE )
> just because the element is not 'scannable'.
>
> It may well be appropriate for an implementation to issue a warning when
> lengthKind is 'delimited' or 'pattern' and the element's content is not
> 'scannable'.
>
> regards,
>
> Tim Kimber, DFDL Team,
> Hursley, UK
> Internet:  *kimbert at uk.ibm.com* <kimbert at uk.ibm.com>
> Tel. 01962-816742
> Internal tel. 37246742
>
>
>
>
> From:        Mike Beckerle <*mbeckerle.dfdl at gmail.com*<mbeckerle.dfdl at gmail.com>
> >
> To:        *dfdl-wg at ogf.org* <dfdl-wg at ogf.org>,
> Date:        10/07/2013 18:22
> Subject:        [DFDL-WG] issue: scannable and 'results are not
> predictable'
> Sent by:        *dfdl-wg-bounces at ogf.org* <dfdl-wg-bounces at ogf.org>
>  ------------------------------
>
>
>
> I was editing the definition of scannable into the glossary and when I
> looked at usage of 'scannable' in testPattern I found this:
>
> In the box for testPattern it says if the data is not scannable "the
> results are not predictable".
>
> Is that sufficient?
>
> We can sometimes statically determine that the schema says the data should
> all be scannable (e.g., no change of encoding, no binary elements), and
> that would rule out one non-predictability.  So, if data is non-scannable
> in the sense that the schema contains say, a binary element, we can issue
> an SDE if lengthKind is pattern or a testPattern assert is being used.
>
> We could also SDE if runtime-valued encoding properties are used and the
> encoding changes inside a scannable context.
>
> Well, I guess testKind pattern asserts/discriminators are an issue because
> they may look only at the first part of the data of a complex component, so
> they don't require everything to be scannable, only the part the regex
> actually examines. So in this case it's user-beware, and if non-scannable I
> suppose we could issue a warning.
>
> But the spec does not say this is an SDE or warning currently. It just
> says results are not predictable.
>
> There is also the fact that the data might be broken, i.e., the schema
> might say the data is scannable, but at parse time character decode errors
> occur.  I believe our policy on this is that these cause processing errors.
> This really is orthogonal to scannable, which is a property of a schema
> component.
>
> Comments?
>
> --
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | *
> www.tresys.com* <http://www.tresys.com/>
> --
> dfdl-wg mailing list*
> **dfdl-wg at ogf.org* <dfdl-wg at ogf.org>*
> **https://www.ogf.org/mailman/listinfo/dfdl-wg*<https://www.ogf.org/mailman/listinfo/dfdl-wg>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
> --
>  dfdl-wg mailing list
>  *dfdl-wg at ogf.org* <dfdl-wg at ogf.org>
>  *https://www.ogf.org/mailman/listinfo/dfdl-wg*<https://www.ogf.org/mailman/listinfo/dfdl-wg>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
> --
>   dfdl-wg mailing list
>   *dfdl-wg at ogf.org* <dfdl-wg at ogf.org>
>   *https://www.ogf.org/mailman/listinfo/dfdl-wg*<https://www.ogf.org/mailman/listinfo/dfdl-wg>
>
>
>
> --
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | *
> www.tresys.com* <http://www.tresys.com/>
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
>
>
> --
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
> www.tresys.com
>
>
> --
>   dfdl-wg mailing list
>   dfdl-wg at ogf.org
>   https://www.ogf.org/mailman/listinfo/dfdl-wg
>

-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130724/98030a54/attachment-0001.html>