[DFDL-WG] clarification on behavior of DFDL encodingErrorPolicy='error' and pre-decoding by implementations

Thu Dec 6 11:43:09 EST 2018

Per discussion on WG Call 2018-12-06, some phrasing to this effect to be
added to section 11.2.1.1:

Implementations may pre-decode a limited number of characters for
efficiency; however, such implementation-dependent pre-decoding can cause a
parse error to be detected in some implementations of DFDL that is not
detected by others.

Schema authors are advised not to rely on decoding errors for backtracking
to control the behavior of the parser.

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
<http://www.ogf.org/About/abt_policies.php>

On Tue, Oct 9, 2018 at 6:04 AM Steve Hanson <smh at uk.ibm.com> wrote:

> This can be summarised by saying that a performance optimisation by a DFDL
> implementation should not change a successful parse into a failure (and
> vice-versa) nor should it change the DFDL infoset if the parse is
> successful. I think that goes without saying but we could be explicit and
> add it somewhere.
>
> If an implementation is pre-decoding when parsing, then it needs to be
> sure that whatever it tries to decode must not go a) beyond the end of the
> data (possible if streaming input), and must legitimately be in that
> encoding. If an implementation does some analysis of the schema, and
> realises that the data will always be entirely UTF-8 text, then pre-coding
> is a possible optimisation. If the data is a mixture of text and binary
> then pre-coding would not be a possible optimisation, unless there was also
> a fallback that the code dropped into after a decode error where it did not
> pre-decode.
>
> Regards
>
> Steve Hanson
>
> IBM Hybrid Integration, Hursley, UK
> Architect, *IBM DFDL*
> <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> *smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
> mob:+44-7717-378890
> Note: I work Tuesday to Friday
>
>
>
> From:        Mike Beckerle <mbeckerle.dfdl at gmail.com>
> To:        DFDL-WG <dfdl-wg at ogf.org>
> Date:        08/10/2018 17:16
> Subject:        [DFDL-WG] clarification on behavior of DFDL
> encodingErrorPolicy='error' and pre-decoding by implementations
> Sent by:        "dfdl-wg" <dfdl-wg-bounces at ogf.org>
> ------------------------------
>
>
>
> The DFDL spec isn't clear on when encodingErrorPolicy 'error' is allowed
> to cause an error, and when one must be suppressed, if the implementation
> pre-decodes data into characters.
>
> Example:
>
> Suppose you have what turns out to be 8 characters of text, followed by
> some binary data.
>
> Suppose a DFDL implementation happens to always try to fill a buffer of 64
> decoded characters, just for efficiency reasons.
>
> Depending on what is in the binary data, this may parse the 8 characters
> of text without error, but subsequently hit a decode error, because it has
> strayed into binary data past the text.
>
> There is no actual decode error in the data stream, because parsing should
> determine there are only 8 characters of text, and then switch to parsing
> the binary data using binary means.
>
> The DFDL spec doesn't say this isn't allowed to cause a decode error.
> Perhaps it is implied somewhere? But I didn't find it.
>
> The DFDL spec does point out that for asserts/discriminators with testKind
> pattern, that pattern matching may cause decode errors. But again, suppose
> the regex matching library an implementation uses happens to pre-fetch and
> pre-decode a bunch of characters, but the regex matching library then finds
> a match that is quite short, and stops well before the characters that were
> pre-decoded that caused a decode error.
>
> It would seem to me that this sort of pre-decoding should not cause decode
> errors. but the DFDL spec doesn't state that explicitly.
>
> comments?
>
>
>
>
>
>
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
> *www.tresys.com* <http://www.tresys.com>
> Please note: Contributions to the DFDL Workgroup's email discussions are
> subject to the *OGF Intellectual Property Policy*
> <http://www.ogf.org/About/abt_policies.php>
> --
>  dfdl-wg mailing list
>  dfdl-wg at ogf.org
>  https://www.ogf.org/mailman/listinfo/dfdl-wg
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20181206/eca310bb/attachment-0001.html>