[DFDL-WG] clarification on behavior of DFDL encodingErrorPolicy='error' and pre-decoding by implementations

Tue Oct 9 06:04:39 EDT 2018

This can be summarised by saying that a performance optimisation by a DFDL 
implementation should not change a successful parse into a failure (and 
vice-versa) nor should it change the DFDL infoset if the parse is 
successful. I think that goes without saying but we could be explicit and 
add it somewhere.

If an implementation is pre-decoding when parsing, then it needs to be 
sure that whatever it tries to decode must not go a) beyond the end of the 
data (possible if streaming input), and must legitimately be in that 
encoding. If an implementation does some analysis of the schema, and 
realises that the data will always be entirely UTF-8 text, then pre-coding 
is a possible optimisation. If the data is a mixture of text and binary 
then pre-coding would not be a possible optimisation, unless there was 
also a fallback that the code dropped into after a decode error where it 
did not pre-decode.

Regards

Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday 

From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     DFDL-WG <dfdl-wg at ogf.org>
Date:   08/10/2018 17:16
Subject:        [DFDL-WG] clarification on behavior of DFDL 
encodingErrorPolicy='error' and pre-decoding by implementations
Sent by:        "dfdl-wg" <dfdl-wg-bounces at ogf.org>

The DFDL spec isn't clear on when encodingErrorPolicy 'error' is allowed 
to cause an error, and when one must be suppressed, if the implementation 
pre-decodes data into characters.

Example:

Suppose you have what turns out to be 8 characters of text, followed by 
some binary data.

Suppose a DFDL implementation happens to always try to fill a buffer of 64 
decoded characters, just for efficiency reasons.

Depending on what is in the binary data, this may parse the 8 characters 
of text without error, but subsequently hit a decode error, because it has 
strayed into binary data past the text.

There is no actual decode error in the data stream, because parsing should 
determine there are only 8 characters of text, and then switch to parsing 
the binary data using binary means.

The DFDL spec doesn't say this isn't allowed to cause a decode error. 
Perhaps it is implied somewhere? But I didn't find it. 

The DFDL spec does point out that for asserts/discriminators with testKind 
pattern, that pattern matching may cause decode errors. But again, suppose 
the regex matching library an implementation uses happens to pre-fetch and 
pre-decode a bunch of characters, but the regex matching library then 
finds a match that is quite short, and stops well before the characters 
that were pre-decoded that caused a decode error.

It would seem to me that this sort of pre-decoding should not cause decode 
errors. but the DFDL spec doesn't state that explicitly. 

comments?

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject to the OGF Intellectual Property Policy
--
  dfdl-wg mailing list
  dfdl-wg at ogf.org

https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20181009/9389e9ac/attachment-0001.html>