[DFDL-WG] clarification on behavior of DFDL encodingErrorPolicy='error' and pre-decoding by implementations
Steve Hanson
smh at uk.ibm.com
Tue Oct 9 06:04:39 EDT 2018
This can be summarised by saying that a performance optimisation by a DFDL
implementation should not change a successful parse into a failure (and
vice-versa) nor should it change the DFDL infoset if the parse is
successful. I think that goes without saying but we could be explicit and
add it somewhere.
If an implementation is pre-decoding when parsing, then it needs to be
sure that whatever it tries to decode must not go a) beyond the end of the
data (possible if streaming input), and must legitimately be in that
encoding. If an implementation does some analysis of the schema, and
realises that the data will always be entirely UTF-8 text, then pre-coding
is a possible optimisation. If the data is a mixture of text and binary
then pre-coding would not be a possible optimisation, unless there was
also a fallback that the code dropped into after a decode error where it
did not pre-decode.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From: Mike Beckerle <mbeckerle.dfdl at gmail.com>
To: DFDL-WG <dfdl-wg at ogf.org>
Date: 08/10/2018 17:16
Subject: [DFDL-WG] clarification on behavior of DFDL
encodingErrorPolicy='error' and pre-decoding by implementations
Sent by: "dfdl-wg" <dfdl-wg-bounces at ogf.org>
The DFDL spec isn't clear on when encodingErrorPolicy 'error' is allowed
to cause an error, and when one must be suppressed, if the implementation
pre-decodes data into characters.
Example:
Suppose you have what turns out to be 8 characters of text, followed by
some binary data.
Suppose a DFDL implementation happens to always try to fill a buffer of 64
decoded characters, just for efficiency reasons.
Depending on what is in the binary data, this may parse the 8 characters
of text without error, but subsequently hit a decode error, because it has
strayed into binary data past the text.
There is no actual decode error in the data stream, because parsing should
determine there are only 8 characters of text, and then switch to parsing
the binary data using binary means.
The DFDL spec doesn't say this isn't allowed to cause a decode error.
Perhaps it is implied somewhere? But I didn't find it.
The DFDL spec does point out that for asserts/discriminators with testKind
pattern, that pattern matching may cause decode errors. But again, suppose
the regex matching library an implementation uses happens to pre-fetch and
pre-decode a bunch of characters, but the regex matching library then
finds a match that is quite short, and stops well before the characters
that were pre-decoded that caused a decode error.
It would seem to me that this sort of pre-decoding should not cause decode
errors. but the DFDL spec doesn't state that explicitly.
comments?
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
--
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20181009/9389e9ac/attachment-0001.html>
More information about the dfdl-wg
mailing list