[DFDL-WG] Action 242: units 'characters' and encodingErrorPolicy 'error' - valueLength and contentLength functions

Tue Jul 19 14:18:39 EDT 2016

In analysis of the valueLength and contentLength email discussion threads,
we're not converged on whether these functions when measured in units
'characters', are allowed to compute the length without checking for decode
errors when (a) the encoding is fixed width - so a character unit is just
an alias for some number of bytes (b) encodingErrorPolicy='error'.

I think we need to clarify that just because encodingErrorPolicy is
'error', doesn't mean all data in scope of that will be scanned to be sure
there are no decode errors.

Other features with similar issues are regex pattern asserts. In this case
the regex is matching against text, and that match might or might not
encounter a decode error, but the entire scope of data it's talking about
is NOT going to get converted just to insure no chance of a decode error.

Can we go so far as to say DFDL implementations can (or maybe even must)
optimize performance by avoiding character decoding when possible? This
means that some character decode errors may not be detected even though
dfdl:encodingErrorPolicy is 'error'.

I would suggest the language should say that only character decoding that
results in a character being placed into the DFDL Infoset is guaranteed to
cause an error should that character not be decodable.  (Similarly for
unparsing, it is only if we actually unparse an unmapped character from the
infoset to the output stream, that a encoding error is guaranteed to occur.

I might not have worded it well, but I think the above is what we're trying
to allow - implementations are free to exploit fixed character width, and
just jumping around the bytes and not decoding/encoding anything - whenever
they can, because we all expect, and think our users will expect, this
level of performance.

If you use UTF-8, certainly a common thing,( or any other variable-width
encoding) then you are likely to get some cases where implementations say
"can't do that with utf-8" because it's just a limitation of the
implementation. One may also get cases where switching from utf-16 to utf-8
for your data changes the behavior of the processor because utf-16 wouldn't
detect some decoding errors because of fixed width, whereas when using
utf-8 will have to measure length in characters by decoding and so will
detect the error.

Basically, we've tried to make a consistent position around a messy area:
The schema contains a complex type element which is a mixture of character
encodings, binary stuff, and may have decode errors in the corresponding
data stream (or characters in the infoset that have no mapping into the
representation of the encoding - for unparsing).

Given this mess, there are ways that a schema can look at it, and
foolishly-perhaps, treat it as characters. Asserts with test patterns are
one. Specified length with units of 'characters' is another, and the
dfdl:contentLength and dfdl:valueLength functions is yet another, since one
can specify the units 'characters' as that 2nd argument.

The only consistent position is that dfdl:contentLength or dfdl:valueLength
of an element, with units 'character' does NOT necessarily imply those
characters will be decoded/encoded.

Anyway, it's moot in the scenario where an earlier OVC wants the
dfdl:contentLength of something later,.... if when we unparse the thing
later, we get the decode error at that point. We're just getting what is
arguably an incorrect OVC computation, but followed by a later decode
error. ....

Comments?

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
<http://www.ogf.org/About/abt_policies.php>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20160719/e8618bbf/attachment-0001.html>