[DFDL-WG] Action 217 - scannable - Re: clarification needed for scan and scannable

Tue Aug 27 20:31:15 EDT 2013

I've taken your errata doc as the current, and will create 14.4, and cover
this issue as you suggest, which is by mods to 2.119, and 3.9.

However, proper ressolution of this scan/scannable-as-text thing is not yet
clear. As I have been trying to fix this I have encountered all sorts of
confusion.

Consider this:

for lengthKind='pattern', what we are trying to express is a recursive
property of the element declaration (the schema component), and all it
contains in the dynamic sense, of all it lexically has nessted in it, as
well as everything it contains by way of references to elements, types,
groups, prefixLengthType, etc.. The property insures that it is meaningful
to scan over the entire extent of the data for the element using a single
character set encoding. Nothing with different assumptions about the
textual data is allowed to be contained therein (no binary anything, no
framing, no prefix lengths, etc. that disagree about encoding), so the scan
can just pass over all of it to determine the length.

The above insures that no well-formed data stream can cause a decode error,
or a false match/non-match of the pattern due to data appearing falsely as
characters in the intended encoding when it is actually binary data or data
in some other encoding. If a decode error occurs, it must mean the data is
not well formed and encodingErrorPolicy then applies.

Our current description of scannable-as-text tries to express the above
recursive property, but it is nowwhere near complete enough. It doesn't
discuss say, that nothing can have leadingSkip for example.Our current
description is also confused about whether it is talking about the data or
the schema component being used to parse the data.

for asserts/discriminators with testKind='pattern', the assumptions are
very different. The regex scan converts the data using the dfdl:encoding
that is in scope for the assert/discriminator. (This might be iso-8859-1,
whereas the actual content might be described by an element that isn't even
textual.) There is no requirement on the schema whatsoever. The requirement
is that the data decode properly in the dfdl:encoding of the
assert/discriminator, and even then, errors are handled by a specific
mechanism.

The difference here: lengthKind pattern must traverse the whole object to
determine length. testKind patterns are potentially just peeking at the
start of it.

So, it is back to the drawing board on this, which affects the glossary,
7.3.1 and 7.3.2, 12.3.5, and Erratum 2.119 and Erratum 3.9.

I believe we need two entirely different descriptions, and to me they don't
belong in the glossary, but in the specific sections on lengthKind pattern,
and on testKind pattern respectively. The definitions are not compact
enough to make sense in the glossary.

We should consider dropping this restriction on lengthKind='pattern'. and
allow lengthKind pattern to be used over any sort of data with the exact
same caveats as testKind='pattern', which is to say, user beware about what
is inside something with lengthKind='pattern', and if there are decode
errors, then dfdl:encodingErrorPolicy applies, and you either get a PE, or
a substitution character (which probably, but not necessarily means
no-match, hence, length 0).

I think this restriction was my idea, but given the complexity it takes to
describe it, I seriously am questioning whether it is worth it, and whether
a weaker semantics is sufficient.

On Tue, Aug 27, 2013 at 6:53 AM, Steve Hanson <smh at uk.ibm.com> wrote:

> Mike
>
> I think this is best handled by updating existing errata 2.119, and
> updating 3.9 (not 2.9).
>
> I've attached an updated errata document that has corrected some typos
> etc, but which does not address the above.
>
>
>
> Regards
>
> Steve Hanson
> Architect, IBM Data Format Description Language (DFDL)
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> IBM SWG, Hursley, UK*
> **smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
>
>
>
> From:        Mike Beckerle <mbeckerle.dfdl at gmail.com>
> To:        Steve Hanson/UK/IBM at IBMGB,
> Cc:        "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>
> Date:        24/08/2013 18:34
> Subject:        Action 217 - scannable - Re: [DFDL-WG] clarification
> needed for scan and scannable
> ------------------------------
>
>
>
> I propose this new erratum, and corresponding edits to erratum 2.9.
>
> In the draft r14.3 (to be circulated soon), I have OPEN comment bubbles to
> review the impact of this change, but I have edited this stuff in, as you
> really have to see it in action to see if it "works for you".
>
> *Errata 2.155* Sections 3, 7.3.1, 7.3.2, 12.3.5. Scan, scannable,
> scannable-as-text
>
> These terms all added to/changed in the glossary. Definitions removed from
> the prose. Scannable now means able to scan, which is natural. More
> specific term scannable-as-text used when we want the recursive requirement
> of uniform encoding.
>
> Errata 2.9 updated to use term scannable-as-text.
>
>
>
>
> On Wed, Aug 14, 2013 at 5:45 AM, Steve Hanson <*smh at uk.ibm.com*<smh at uk.ibm.com>>
> wrote:
> Action 217 raised to decide new terminology for the regex scanning
> encoding requirement.
>
> Regards
>
> Steve Hanson
> Architect, IBM Data Format Description Language (DFDL)
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> IBM SWG, Hursley, UK*
> **smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:*+44-1962-815848* <%2B44-1962-815848>
>
>
>
> From:        Mike Beckerle <*mbeckerle.dfdl at gmail.com*<mbeckerle.dfdl at gmail.com>
> >
> To:        Tim Kimber/UK/IBM at IBMGB,
> Cc:        Steve Hanson/UK/IBM at IBMGB
> Date:        30/07/2013 01:01
> Subject:        Re: clarification needed for scan and scannable
>  ------------------------------
>
>
>
> I reviewed the spec. I have to reverse my prior statement. What you
> describe is definitely currently allowed by the spec. The 'scannable'
> restriction is reserved for lengthKind='pattern'.
>
> In my view this is a terrible idea, but it is where it is. I.e., that the
> lengthKind 'delimited' doesn't require 'scannable' characteristic
> throughout everything that is contained within it. It will not allow us to
> really assist users in identifying what are very likely to be mistakes in
> their schemas, because what we call 'delimited' is much too permissive.
>
> I am not going to argue to change this at the current time. This is just
> due to the need to converge on the standard. My preference would be that
> lengthKind 'delimited' requires scannability (as uniform text), and that
> some new keyword be used to mean the current algorithm as specified.
>
> I do suggest that we rename the characteristic scannability to
> scannable-as-text, in order to make it clearer what the requirement is, and
> clarify that this is a requirement on the schema component.
>
> I suspect that lengthKind='delimited' will perhaps someday (DFDL v2.0?) be
> deprecated and/or augmented by other keywords that are more specific such
> as "delimitedText" which means scanning for delimiters over only text in
> one encoding (the 'scannable' restriction), and other keywords like
> "delimitedBinary" or "delimitedMixed" meaning formats which admit the more
> powerful and complex things.
>
> My argument is simple:  if people have something like delimiters in
> distinct encodings appearing in their schema, it is most-likely due to a
> human error (they missed one place when they were editing the encoding
> property), rather than something as obscure as this
> delimiters-in-a-different-encoding being a real characteristic of the data.
> An SDE here will help them find this error.
>
> Furthermore, if you truly want a string to be terminated by either an
> ebcdic comma (6B) or an ascii linefeed (0A), then you have a few
> alternatives before you have to go to the now optional feature of
> 'rawbyte', and which are consistent with scannability.
>
> First, use encoding ascii, and specify terminators of %LF;, and 'k' (or
> %#x6B; since 'k' is 6B ascii), in which case the string is assumed to
> contain only ascii code units. In that case, if ebcdic 'a' (x81 - illegal
> in ascii) is encountered in the data, you will either get an error or a
> unicode replacement character depending on encodingErrorPolicy. If you know
> your data will only ascii-legal code units, then this is a good solution.
>
> I would note that your example (previously in the thread) did not specify
> an encoding for the innermost string elements. Did you intend for those to
> be ebcdic or ascii?
>
> Alternatively you can specify encoding ebcdic-cp-us, specify terminator of
> comma (codepoint x6B) and %#x0A; which is the RPT control code in ebcdic
> but not an illegal character. In that case the string can contain any legal
> ebcdic code point. However, if code unit x70 is encountered (corresponds to
> an ascii 'p', but unmapped in ebcdic-cp-us), you will either get an error,
> or a unicode replacement character depending on encodingErrorPolicy. If you
> know your data will contain only legal ebcdic code units, then this is a
> good solution.
>
> Finally, you can specify encoding iso-8859-1, and terminators %#x6B; (or
> 'k' which is 6B in iso-8859-1) and %#x0A; (linefeed in  iso-8859-1). Then
> any code units at all will be legal as all bytes have corresponding
> character codepoints in iso-8859-1. If you have no idea what the data is,
> but just want some sort of string out of it, and know only that the
> terminators are these bytes, then this is a good solution. If your data
> contains, for example, packed-decimal numbers, then this is the only way to
> safely scan past it as a string, because both ebcdic and ascii have
> unmapped code points from the code units that could appear in
> packed-decimal number bytes.
>
> All the above are consistent with implementations that attempt to convert
> data from code-units (bytes) to Unicode codepoints using a character set
> decoder, and then apply any scanning/searching only in the Unicode realm,
> and work in one pass over the data.
>
> I currently think that to implement both rawbytes and encodingErrorPolicy
> it will require two passes over the data. I expect this overhead to be
> negligible, so I'm not worried about it really.
>
>
> ...mike
>
> On Mon, Jul 29, 2013 at 7:18 AM, Tim Kimber <*KIMBERT at uk.ibm.com*<KIMBERT at uk.ibm.com>>
> wrote:
> Hi Mike,
>
> Good - that was almost exactly the reply that I was expecting. I now
> understand exactly where you are coming from, and how we arrived at this
> position.
>
> First, a few statements that I think are true. I want to establish some
> basic ground rules before we decide how to go forward:
> a) it is desirable for DFDL ( more accurately, a valid subset of DFDL ) to
> be implementable using well known parsing techniques. I think that pretty
> much means regular expressions and BNF-style grammars. That implies that it
> might be possible to implement a DFDL parser using one of the well-known
> parser-generator technologies like Yacc/Bison/JavaCC/Antlr. I'm not
> claiming that it *is* possible, but I think it would be a good thing if it
> was.
> b) It is technically feasible to implement a DFDL parser that can handle
> the mixed-encoding example using regex technology. However, it would not be
> a very efficient implementation because the scanner would have to scan the
> data once per encoding, and it would have to do that for every character
> after the end of field1.
> c) It is possible to produce an efficient scanner that handles mixed
> encodings. Such a scanner cannot use regex technology for scanning - in
> fact, I think the only efficient implementation is to convert all
> terminating markup into byte sequences and then perform all scanning in the
> byte domain. This is what the IBM implementation does.
>
> The scenario in my example is not entirely far-fetched - it is conceivable
> that the encoding might change mid-way through a document, and I think
> Steve came up with a real-world format that required this ( I have a hazy
> memory of discussing this a couple of years ago ). The requirement for the
> raw byte entity is not for this case - it is for the case where the
> delimiters are genuinely not characters ( e.g. a UTF-16 data stream
> terminated by a single null byte ). However, it is not easy to come up with
> realistic examples where the raw byte entity could not be translated into a
> character before the processor uses it. I think that's where some of the
> confusion has arisen.
>
> We have already agreed to make the raw byte entity an optional feature. We
> should consider disallowing mixed encodings when lengthKind is delimited.
> If we cannot do that then I agree with Mike that the descriptions and
> definitions in the spec need to be made a bit clearer.
>
> regards,
>
> Tim Kimber, DFDL Team,
> Hursley, UK
> Internet:  *kimbert at uk.ibm.com* <kimbert at uk.ibm.com>
> Tel. 01962-816742
> Internal tel. 37246742
>
>
>
>
> From:        Mike Beckerle <*mbeckerle.dfdl at gmail.com*<mbeckerle.dfdl at gmail.com>
> >
> To:        Tim Kimber/UK/IBM at IBMGB,
> Cc:        Steve Hanson/UK/IBM at IBMGB
> Date:        26/07/2013 22:49
> Subject:        Re: clarification needed for scan and scannable
>  ------------------------------
>
>
>
> Clearly we have to rework the description and definitions because my
> understanding of the current spec would say your schema is not valid
> because it is not scannable. Everything in the scope of the enclosing
> element which is delimited must be scannable and that means uniform
> encoding.
> It is exactly to rule out this mixed encoding ambiguity that we have the
> restriction.
> I have no idea how to implement delimiters without this restriction.
>
> I think the case you have here is what raw bytes are for. Though I am no
> longer clear on how to implement rawbytes either.
>
> On Jul 25, 2013 6:40 PM, "Tim Kimber" <*KIMBERT at uk.ibm.com*<KIMBERT at uk.ibm.com>>
> wrote:
> Suppose I have this DFDL XSD:
>
> <*xs:element* name="documentRoot" dfdl:encoding="US-ASCII" dfdl:lengthKind
> ="delimited" dfdl:terminator="]]]">
>  <*xs:complexType*>
>    <*xs:sequence*>
>      <*xs:element* name="delimitedRecord" maxOccurs="unbounded"
>                  dfdl:encoding="US-ASCII" dfdl:lengthKind="delimited"dfdl:terminator
> ="%LF;">
>        <*xs:complexType*>
>          <*xs:sequence* dfdl:separator="," dfdl:separatorSuppressionPolicy
> ="suppressedAtEndLax" dfdl:encoding="EBCDIC-US">
>            <*xs:element* name="field1" type="xs:string" dfdl:lengthKind=
> "delimited">
>            <*xs:element* name="field2" type="xs:string" dfdl:lengthKind=
> "delimited" minOccurs="0">
>            <*xs:element* name="field3" type="xs:string" dfdl:lengthKind=
> "delimited" minOccurs="0">
>          <*/xs:sequence*>
>        <*/xs:complexType*>
>      <*/xs:element*>
>    <*/xs:sequence*>
>  <*/xs:complexType*>
> <*/xs:element*>
>
>
> ...which will parse some data like this:
> field1Value,field2Value,field3Value
> field1Value,
> field1Value,field2Value]]]
> ...except that the commas that delimit the fields are in EBCDIC and the
> linefeeds that delimit the records are in ASCII .
>
> I have purposely constructed this example with separatorSuppressionPolicy
> set to 'suppressedAtEndLax'. This means that field1 and field2 could both
> be terminated by either of
> a) an EBCDIC comma or
> b) an ASCII line feed
>
> This is a very artificial example, but it is valid DFDL and should not
> produce schema definition error. If we use the term 'scan' when discussing
> lengthKind='delimited' then we need to be careful how to define the term
> 'scannable' - otherwise we might appear to prohibit things that are
> actually valid. I think this is the same point that Steve was making.
>
> Most implementations will do the same as Daffodil and will use a regex
> engine to implement all types of scanning, including
> lengthKind='delimited'. It's the most natural solution, and it can be made
> to work as long as
> a) the implementation takes extra care if/when the encoding changes within
> a component and
> b) the implementation either does not support the raw byte entity, or only
> supports it when it can be translated to a valid character in the
> component's encoding.
>
> There is huge scope for confusion in this area - most readers of the DFDL
> specification will assume that delimited scanning can be implemented using
> regex technology + EBNF grammar. It may be possible ( I'm not claiming that
> it is btw ), but the grammar would be decidedly non-trivial. so the DFDL
> specification should take extra care to be unambiguous when discussing how
> to scan for terminating markup.
>
> regards,
>
> Tim Kimber, DFDL Team,
> Hursley, UK
> Internet:  *kimbert at uk.ibm.com* <kimbert at uk.ibm.com>
> Tel. 01962-816742
> Internal tel. 37246742
>
>
>
>
> From:        Mike Beckerle <*mbeckerle.dfdl at gmail.com*<mbeckerle.dfdl at gmail.com>
> >
> To:        Steve Hanson/UK/IBM at IBMGB,
> Cc:        Tim Kimber/UK/IBM at IBMGB
> Date:        25/07/2013 19:04
> Subject:        clarification needed for scan and scannable
> ------------------------------
>
>
>
>
>  If a user has read the Glossary and noted the definition of 'scannable'
> then when he sees the terms 'scanning' and 'scanned' in 12.3.2 he may think
> that this implies data being read by lengthKind must be 'scannable', and it
> is not so.  That's how I read the spec, hence my original comment. I find
> it confusing to define 'scan' and then not to define  'scannable' as 'able
> to be *scan*ned'.
> Ok, so i am rethinking how we express what we mean by scan.
>
> scan - verb. The action of attempting to find something in characters.
> Implies decoding the characters from code units to character codes.
>
> A scan can succeed on real data, or fail (no match, not found), but both
> are normal behaviors for a scan.
>
> However, this is predicated on the data at least being meaningfully
> decoded into characters of a single character set encoding. This is because
> our regex technology isn't expected to be able to shift encodings mid
> match/scan, nor is it expected to be able to jump over/around things that
> aren't textual so as not to misinterpret them as code units of characters.
>
> So saying a schema component is scannable - means the schema expresses the
> assumption that the data is expected to be all textual/characters in a
> uniform encoding so that it is meaningful to talk about scanning it with
> the kind of regex technology we have available today. That is, there's no
> need to consider encoding changes mid match, nor what data to jump
> over/around.
>
> This is the sense of "scan" -able that we use the term to mean. Our
> expectation as expressed in the schema, is that so long as the data decodes
> without error the scan will either succeed or fail normally.
>
> Actual data is scannable if it meets this expectation.
>
> Contrast with what happens for an assert with testKind='pattern'. In that
> case we scan, but DFDL processors aren't required to provide any guarantee
> that the data is scannable, which means the designer of that regex pattern
> must really know the data, and know what they are doing, as to avoid the
> issues of decode-errors.
>
> Why do we do this? Because we don't want to have to refactor a complex
> type element into two sequences one of which is the stuff at the front that
> an assert with test pattern examines, and the other of which is the stuff
> after that. In some situations it may be very awkward or impossible to
> separate out the part at the front that we need to scrutinize. Instead we
> just allow you to put the assert at the beginning, and it is up to the
> regex designer to know that the first few fields it looks at the data for,
> are scannable (in the sense of the expectation of them being text), or if
> not scannable in terms of that expectation from the schema, then at least
> that the actual contents of them will not have data that interpreted as
> character code points, will cause decode errors. In other words the writer
> of a test pattern has to know that either the data is scannable (schema
> says so), or that the actual data will actually be scan-able, in that
> decode errors won't occur if it is decoded.
>
> This also provides a backdoor by which one can use regex technology to
> match against binary data. But then you really have to understand the data
> so as to understand exactly what code units will be found in it, so one can
> understand what the regex technology will do. One can even play games with
> encoding="iso-8859-1", where no decoding errors are possible, so as to
> guarantee no decode errors when scanning binary data that really shouldn't
> be thought of as text.
>
> The part of the data that a test pattern of an assert actually looks at
> needs to be scannable by schema expectation, or scannable in actual data
> contents.
>
> If you write a test pattern, expecting to match things in what will turn
> out to be the first and second elements of a complex type, but the encoding
> is different for the second element, then your test pattern isn't going to
> work (you may get false matches, false non-matches, or may get decode
> errors) because the assumption that the data is scannable for that first
> and second element, is violated.
>
>  Summary: can we just say
>
> scan - verb
> scannable - a schema component is scannable if ... current definition.
> Data is scannable with respect to a specific character set encoding if,
> when it is interpreted as a sequence of code units, they always decode
> without error.
>
> There are two ways to make "any data" scannable for certain. One - set
> encodingErrorPolicy to 'replace'. Now all data will decode without error
> because you get substitution characters insteead.
>
> Two - change encoding to iso-8859-1 or other supported encoding where
> every byte is a legal code unit.
>
> If you are in neither situation One or Two above, then you have to say the
> data is scannable, but with respect to a specific encoding. So saying the
> data is scannable as ASCII means that it will not contain any bytes that
> interpreted as ASCII code units will be illegal. For this example, that
> means all the bytes have values from 0 to 127 (high bit not set). If you
> know that will be true of your data, then even if it is binary data you can
> say it is scannable as ASCII.
>
> The representation of Packed Decimal numbers are not scannable as ascii,
> for example, because any high-nibble digit 8 or greater will cause the byte
> containing it to have a value 128 or higher, for which there is no
> corresponding ascii code unit.
>
> Packed Decimal numbers are also not scannable as UTF-8, because many
> packed decimal bytes will not be legal utf-8 code unit values, nor will
> they be in the ordered arrangements UTF-8 requires.
>
>
>
>
>
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
>
>
> --
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | *
> www.tresys.com* <http://www.tresys.com/>
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
> --
>   dfdl-wg mailing list
>   *dfdl-wg at ogf.org* <dfdl-wg at ogf.org>
>   *https://www.ogf.org/mailman/listinfo/dfdl-wg*<https://www.ogf.org/mailman/listinfo/dfdl-wg>
>
>
>
> --
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | *
> www.tresys.com* <http://www.tresys.com/>
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>

-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130827/a837f3ed/attachment-0001.html>