[DFDL-WG] Action 217 - scannable - Re: clarification needed for scan and scannable

Sat Aug 24 13:34:46 EDT 2013

I propose this new erratum, and corresponding edits to erratum 2.9.

In the draft r14.3 (to be circulated soon), I have OPEN comment bubbles to
review the impact of this change, but I have edited this stuff in, as you
really have to see it in action to see if it "works for you".

*Errata 2.155* Sections 3, 7.3.1, 7.3.2, 12.3.5. Scan, scannable,
scannable-as-text

These terms all added to/changed in the glossary. Definitions removed from
the prose. Scannable now means able to scan, which is natural. More
specific term scannable-as-text used when we want the recursive requirement
of uniform encoding.

Errata 2.9 updated to use term scannable-as-text.

On Wed, Aug 14, 2013 at 5:45 AM, Steve Hanson <smh at uk.ibm.com> wrote:

> Action 217 raised to decide new terminology for the regex scanning
> encoding requirement.
>
> Regards
>
> Steve Hanson
> Architect, IBM Data Format Description Language (DFDL)
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> IBM SWG, Hursley, UK*
> **smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
>
>
>
> From:        Mike Beckerle <mbeckerle.dfdl at gmail.com>
> To:        Tim Kimber/UK/IBM at IBMGB,
> Cc:        Steve Hanson/UK/IBM at IBMGB
> Date:        30/07/2013 01:01
> Subject:        Re: clarification needed for scan and scannable
> ------------------------------
>
>
>
> I reviewed the spec. I have to reverse my prior statement. What you
> describe is definitely currently allowed by the spec. The 'scannable'
> restriction is reserved for lengthKind='pattern'.
>
> In my view this is a terrible idea, but it is where it is. I.e., that the
> lengthKind 'delimited' doesn't require 'scannable' characteristic
> throughout everything that is contained within it. It will not allow us to
> really assist users in identifying what are very likely to be mistakes in
> their schemas, because what we call 'delimited' is much too permissive.
>
> I am not going to argue to change this at the current time. This is just
> due to the need to converge on the standard. My preference would be that
> lengthKind 'delimited' requires scannability (as uniform text), and that
> some new keyword be used to mean the current algorithm as specified.
>
> I do suggest that we rename the characteristic scannability to
> scannable-as-text, in order to make it clearer what the requirement is, and
> clarify that this is a requirement on the schema component.
>
> I suspect that lengthKind='delimited' will perhaps someday (DFDL v2.0?) be
> deprecated and/or augmented by other keywords that are more specific such
> as "delimitedText" which means scanning for delimiters over only text in
> one encoding (the 'scannable' restriction), and other keywords like
> "delimitedBinary" or "delimitedMixed" meaning formats which admit the more
> powerful and complex things.
>
> My argument is simple:  if people have something like delimiters in
> distinct encodings appearing in their schema, it is most-likely due to a
> human error (they missed one place when they were editing the encoding
> property), rather than something as obscure as this
> delimiters-in-a-different-encoding being a real characteristic of the data.
> An SDE here will help them find this error.
>
> Furthermore, if you truly want a string to be terminated by either an
> ebcdic comma (6B) or an ascii linefeed (0A), then you have a few
> alternatives before you have to go to the now optional feature of
> 'rawbyte', and which are consistent with scannability.
>
> First, use encoding ascii, and specify terminators of %LF;, and 'k' (or
> %#x6B; since 'k' is 6B ascii), in which case the string is assumed to
> contain only ascii code units. In that case, if ebcdic 'a' (x81 - illegal
> in ascii) is encountered in the data, you will either get an error or a
> unicode replacement character depending on encodingErrorPolicy. If you know
> your data will only ascii-legal code units, then this is a good solution.
>
> I would note that your example (previously in the thread) did not specify
> an encoding for the innermost string elements. Did you intend for those to
> be ebcdic or ascii?
>
> Alternatively you can specify encoding ebcdic-cp-us, specify terminator of
> comma (codepoint x6B) and %#x0A; which is the RPT control code in ebcdic
> but not an illegal character. In that case the string can contain any legal
> ebcdic code point. However, if code unit x70 is encountered (corresponds to
> an ascii 'p', but unmapped in ebcdic-cp-us), you will either get an error,
> or a unicode replacement character depending on encodingErrorPolicy. If you
> know your data will contain only legal ebcdic code units, then this is a
> good solution.
>
> Finally, you can specify encoding iso-8859-1, and terminators %#x6B; (or
> 'k' which is 6B in iso-8859-1) and %#x0A; (linefeed in  iso-8859-1). Then
> any code units at all will be legal as all bytes have corresponding
> character codepoints in iso-8859-1. If you have no idea what the data is,
> but just want some sort of string out of it, and know only that the
> terminators are these bytes, then this is a good solution. If your data
> contains, for example, packed-decimal numbers, then this is the only way to
> safely scan past it as a string, because both ebcdic and ascii have
> unmapped code points from the code units that could appear in
> packed-decimal number bytes.
>
> All the above are consistent with implementations that attempt to convert
> data from code-units (bytes) to Unicode codepoints using a character set
> decoder, and then apply any scanning/searching only in the Unicode realm,
> and work in one pass over the data.
>
> I currently think that to implement both rawbytes and encodingErrorPolicy
> it will require two passes over the data. I expect this overhead to be
> negligible, so I'm not worried about it really.
>
>
> ...mike
>
> On Mon, Jul 29, 2013 at 7:18 AM, Tim Kimber <*KIMBERT at uk.ibm.com*<KIMBERT at uk.ibm.com>>
> wrote:
> Hi Mike,
>
> Good - that was almost exactly the reply that I was expecting. I now
> understand exactly where you are coming from, and how we arrived at this
> position.
>
> First, a few statements that I think are true. I want to establish some
> basic ground rules before we decide how to go forward:
> a) it is desirable for DFDL ( more accurately, a valid subset of DFDL ) to
> be implementable using well known parsing techniques. I think that pretty
> much means regular expressions and BNF-style grammars. That implies that it
> might be possible to implement a DFDL parser using one of the well-known
> parser-generator technologies like Yacc/Bison/JavaCC/Antlr. I'm not
> claiming that it *is* possible, but I think it would be a good thing if it
> was.
> b) It is technically feasible to implement a DFDL parser that can handle
> the mixed-encoding example using regex technology. However, it would not be
> a very efficient implementation because the scanner would have to scan the
> data once per encoding, and it would have to do that for every character
> after the end of field1.
> c) It is possible to produce an efficient scanner that handles mixed
> encodings. Such a scanner cannot use regex technology for scanning - in
> fact, I think the only efficient implementation is to convert all
> terminating markup into byte sequences and then perform all scanning in the
> byte domain. This is what the IBM implementation does.
>
> The scenario in my example is not entirely far-fetched - it is conceivable
> that the encoding might change mid-way through a document, and I think
> Steve came up with a real-world format that required this ( I have a hazy
> memory of discussing this a couple of years ago ). The requirement for the
> raw byte entity is not for this case - it is for the case where the
> delimiters are genuinely not characters ( e.g. a UTF-16 data stream
> terminated by a single null byte ). However, it is not easy to come up with
> realistic examples where the raw byte entity could not be translated into a
> character before the processor uses it. I think that's where some of the
> confusion has arisen.
>
> We have already agreed to make the raw byte entity an optional feature. We
> should consider disallowing mixed encodings when lengthKind is delimited.
> If we cannot do that then I agree with Mike that the descriptions and
> definitions in the spec need to be made a bit clearer.
>
> regards,
>
> Tim Kimber, DFDL Team,
> Hursley, UK
> Internet:  *kimbert at uk.ibm.com* <kimbert at uk.ibm.com>
> Tel. 01962-816742
> Internal tel. 37246742
>
>
>
>
> From:        Mike Beckerle <*mbeckerle.dfdl at gmail.com*<mbeckerle.dfdl at gmail.com>
> >
> To:        Tim Kimber/UK/IBM at IBMGB,
> Cc:        Steve Hanson/UK/IBM at IBMGB
> Date:        26/07/2013 22:49
> Subject:        Re: clarification needed for scan and scannable
>  ------------------------------
>
>
>
> Clearly we have to rework the description and definitions because my
> understanding of the current spec would say your schema is not valid
> because it is not scannable. Everything in the scope of the enclosing
> element which is delimited must be scannable and that means uniform
> encoding.
> It is exactly to rule out this mixed encoding ambiguity that we have the
> restriction.
>
> I have no idea how to implement delimiters without this restriction.
>
> I think the case you have here is what raw bytes are for. Though I am no
> longer clear on how to implement rawbytes either.
>
> On Jul 25, 2013 6:40 PM, "Tim Kimber" <*KIMBERT at uk.ibm.com*<KIMBERT at uk.ibm.com>>
> wrote:
> Suppose I have this DFDL XSD:
>
> <*xs:element* name="documentRoot" dfdl:encoding="US-ASCII" dfdl:lengthKind
> ="delimited" dfdl:terminator="]]]">
>  <*xs:complexType*>
>    <*xs:sequence*>
>      <*xs:element* name="delimitedRecord" maxOccurs="unbounded"
>                  dfdl:encoding="US-ASCII" dfdl:lengthKind="delimited"dfdl:terminator
> ="%LF;">
>        <*xs:complexType*>
>          <*xs:sequence* dfdl:separator="," dfdl:separatorSuppressionPolicy
> ="suppressedAtEndLax" dfdl:encoding="EBCDIC-US">
>            <*xs:element* name="field1" type="xs:string" dfdl:lengthKind=
> "delimited">
>            <*xs:element* name="field2" type="xs:string" dfdl:lengthKind=
> "delimited" minOccurs="0">
>            <*xs:element* name="field3" type="xs:string" dfdl:lengthKind=
> "delimited" minOccurs="0">
>          <*/xs:sequence*>
>        <*/xs:complexType*>
>      <*/xs:element*>
>    <*/xs:sequence*>
>  <*/xs:complexType*>
> <*/xs:element*>
>
>
> ...which will parse some data like this:
> field1Value,field2Value,field3Value
> field1Value,
> field1Value,field2Value]]]
> ...except that the commas that delimit the fields are in EBCDIC and the
> linefeeds that delimit the records are in ASCII .
>
> I have purposely constructed this example with separatorSuppressionPolicy
> set to 'suppressedAtEndLax'. This means that field1 and field2 could both
> be terminated by either of
> a) an EBCDIC comma or
> b) an ASCII line feed
>
> This is a very artificial example, but it is valid DFDL and should not
> produce schema definition error. If we use the term 'scan' when discussing
> lengthKind='delimited' then we need to be careful how to define the term
> 'scannable' - otherwise we might appear to prohibit things that are
> actually valid. I think this is the same point that Steve was making.
>
> Most implementations will do the same as Daffodil and will use a regex
> engine to implement all types of scanning, including
> lengthKind='delimited'. It's the most natural solution, and it can be made
> to work as long as
> a) the implementation takes extra care if/when the encoding changes within
> a component and
> b) the implementation either does not support the raw byte entity, or only
> supports it when it can be translated to a valid character in the
> component's encoding.
>
> There is huge scope for confusion in this area - most readers of the DFDL
> specification will assume that delimited scanning can be implemented using
> regex technology + EBNF grammar. It may be possible ( I'm not claiming that
> it is btw ), but the grammar would be decidedly non-trivial. so the DFDL
> specification should take extra care to be unambiguous when discussing how
> to scan for terminating markup.
>
> regards,
>
> Tim Kimber, DFDL Team,
> Hursley, UK
> Internet:  *kimbert at uk.ibm.com* <kimbert at uk.ibm.com>
> Tel. 01962-816742
> Internal tel. 37246742
>
>
>
>
> From:        Mike Beckerle <*mbeckerle.dfdl at gmail.com*<mbeckerle.dfdl at gmail.com>
> >
> To:        Steve Hanson/UK/IBM at IBMGB,
> Cc:        Tim Kimber/UK/IBM at IBMGB
> Date:        25/07/2013 19:04
> Subject:        clarification needed for scan and scannable
> ------------------------------
>
>
>
>
>  If a user has read the Glossary and noted the definition of 'scannable'
> then when he sees the terms 'scanning' and 'scanned' in 12.3.2 he may think
> that this implies data being read by lengthKind must be 'scannable', and it
> is not so.  That's how I read the spec, hence my original comment. I find
> it confusing to define 'scan' and then not to define  'scannable' as 'able
> to be *scan*ned'.
> Ok, so i am rethinking how we express what we mean by scan.
>
> scan - verb. The action of attempting to find something in characters.
> Implies decoding the characters from code units to character codes.
>
> A scan can succeed on real data, or fail (no match, not found), but both
> are normal behaviors for a scan.
>
> However, this is predicated on the data at least being meaningfully
> decoded into characters of a single character set encoding. This is because
> our regex technology isn't expected to be able to shift encodings mid
> match/scan, nor is it expected to be able to jump over/around things that
> aren't textual so as not to misinterpret them as code units of characters.
>
> So saying a schema component is scannable - means the schema expresses the
> assumption that the data is expected to be all textual/characters in a
> uniform encoding so that it is meaningful to talk about scanning it with
> the kind of regex technology we have available today. That is, there's no
> need to consider encoding changes mid match, nor what data to jump
> over/around.
>
> This is the sense of "scan" -able that we use the term to mean. Our
> expectation as expressed in the schema, is that so long as the data decodes
> without error the scan will either succeed or fail normally.
>
> Actual data is scannable if it meets this expectation.
>
> Contrast with what happens for an assert with testKind='pattern'. In that
> case we scan, but DFDL processors aren't required to provide any guarantee
> that the data is scannable, which means the designer of that regex pattern
> must really know the data, and know what they are doing, as to avoid the
> issues of decode-errors.
>
> Why do we do this? Because we don't want to have to refactor a complex
> type element into two sequences one of which is the stuff at the front that
> an assert with test pattern examines, and the other of which is the stuff
> after that. In some situations it may be very awkward or impossible to
> separate out the part at the front that we need to scrutinize. Instead we
> just allow you to put the assert at the beginning, and it is up to the
> regex designer to know that the first few fields it looks at the data for,
> are scannable (in the sense of the expectation of them being text), or if
> not scannable in terms of that expectation from the schema, then at least
> that the actual contents of them will not have data that interpreted as
> character code points, will cause decode errors. In other words the writer
> of a test pattern has to know that either the data is scannable (schema
> says so), or that the actual data will actually be scan-able, in that
> decode errors won't occur if it is decoded.
>
> This also provides a backdoor by which one can use regex technology to
> match against binary data. But then you really have to understand the data
> so as to understand exactly what code units will be found in it, so one can
> understand what the regex technology will do. One can even play games with
> encoding="iso-8859-1", where no decoding errors are possible, so as to
> guarantee no decode errors when scanning binary data that really shouldn't
> be thought of as text.
>
> The part of the data that a test pattern of an assert actually looks at
> needs to be scannable by schema expectation, or scannable in actual data
> contents.
>
> If you write a test pattern, expecting to match things in what will turn
> out to be the first and second elements of a complex type, but the encoding
> is different for the second element, then your test pattern isn't going to
> work (you may get false matches, false non-matches, or may get decode
> errors) because the assumption that the data is scannable for that first
> and second element, is violated.
>
>  Summary: can we just say
>
> scan - verb
> scannable - a schema component is scannable if ... current definition.
> Data is scannable with respect to a specific character set encoding if,
> when it is interpreted as a sequence of code units, they always decode
> without error.
>
> There are two ways to make "any data" scannable for certain. One - set
> encodingErrorPolicy to 'replace'. Now all data will decode without error
> because you get substitution characters insteead.
>
> Two - change encoding to iso-8859-1 or other supported encoding where
> every byte is a legal code unit.
>
> If you are in neither situation One or Two above, then you have to say the
> data is scannable, but with respect to a specific encoding. So saying the
> data is scannable as ASCII means that it will not contain any bytes that
> interpreted as ASCII code units will be illegal. For this example, that
> means all the bytes have values from 0 to 127 (high bit not set). If you
> know that will be true of your data, then even if it is binary data you can
> say it is scannable as ASCII.
>
> The representation of Packed Decimal numbers are not scannable as ascii,
> for example, because any high-nibble digit 8 or greater will cause the byte
> containing it to have a value 128 or higher, for which there is no
> corresponding ascii code unit.
>
> Packed Decimal numbers are also not scannable as UTF-8, because many
> packed decimal bytes will not be legal utf-8 code unit values, nor will
> they be in the ordered arrangements UTF-8 requires.
>
>
>
>
>
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
>
>
> --
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | *
> www.tresys.com* <http://www.tresys.com/>
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
> --
>   dfdl-wg mailing list
>   dfdl-wg at ogf.org
>   https://www.ogf.org/mailman/listinfo/dfdl-wg
>

-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130824/f79d30ba/attachment-0001.html>