[DFDL-WG] clarification needed for scan and scannable
Steve Hanson
smh at uk.ibm.com
Wed Aug 14 05:45:48 EDT 2013
Action 217 raised to decide new terminology for the regex scanning
encoding requirement.
Regards
Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
From: Mike Beckerle <mbeckerle.dfdl at gmail.com>
To: Tim Kimber/UK/IBM at IBMGB,
Cc: Steve Hanson/UK/IBM at IBMGB
Date: 30/07/2013 01:01
Subject: Re: clarification needed for scan and scannable
I reviewed the spec. I have to reverse my prior statement. What you
describe is definitely currently allowed by the spec. The 'scannable'
restriction is reserved for lengthKind='pattern'.
In my view this is a terrible idea, but it is where it is. I.e., that the
lengthKind 'delimited' doesn't require 'scannable' characteristic
throughout everything that is contained within it. It will not allow us to
really assist users in identifying what are very likely to be mistakes in
their schemas, because what we call 'delimited' is much too permissive.
I am not going to argue to change this at the current time. This is just
due to the need to converge on the standard. My preference would be that
lengthKind 'delimited' requires scannability (as uniform text), and that
some new keyword be used to mean the current algorithm as specified.
I do suggest that we rename the characteristic scannability to
scannable-as-text, in order to make it clearer what the requirement is,
and clarify that this is a requirement on the schema component.
I suspect that lengthKind='delimited' will perhaps someday (DFDL v2.0?) be
deprecated and/or augmented by other keywords that are more specific such
as "delimitedText" which means scanning for delimiters over only text in
one encoding (the 'scannable' restriction), and other keywords like
"delimitedBinary" or "delimitedMixed" meaning formats which admit the more
powerful and complex things.
My argument is simple: if people have something like delimiters in
distinct encodings appearing in their schema, it is most-likely due to a
human error (they missed one place when they were editing the encoding
property), rather than something as obscure as this
delimiters-in-a-different-encoding being a real characteristic of the
data. An SDE here will help them find this error.
Furthermore, if you truly want a string to be terminated by either an
ebcdic comma (6B) or an ascii linefeed (0A), then you have a few
alternatives before you have to go to the now optional feature of
'rawbyte', and which are consistent with scannability.
First, use encoding ascii, and specify terminators of %LF;, and 'k' (or
%#x6B; since 'k' is 6B ascii), in which case the string is assumed to
contain only ascii code units. In that case, if ebcdic 'a' (x81 - illegal
in ascii) is encountered in the data, you will either get an error or a
unicode replacement character depending on encodingErrorPolicy. If you
know your data will only ascii-legal code units, then this is a good
solution.
I would note that your example (previously in the thread) did not specify
an encoding for the innermost string elements. Did you intend for those to
be ebcdic or ascii?
Alternatively you can specify encoding ebcdic-cp-us, specify terminator of
comma (codepoint x6B) and %#x0A; which is the RPT control code in ebcdic
but not an illegal character. In that case the string can contain any
legal ebcdic code point. However, if code unit x70 is encountered
(corresponds to an ascii 'p', but unmapped in ebcdic-cp-us), you will
either get an error, or a unicode replacement character depending on
encodingErrorPolicy. If you know your data will contain only legal ebcdic
code units, then this is a good solution.
Finally, you can specify encoding iso-8859-1, and terminators %#x6B; (or
'k' which is 6B in iso-8859-1) and %#x0A; (linefeed in iso-8859-1). Then
any code units at all will be legal as all bytes have corresponding
character codepoints in iso-8859-1. If you have no idea what the data is,
but just want some sort of string out of it, and know only that the
terminators are these bytes, then this is a good solution. If your data
contains, for example, packed-decimal numbers, then this is the only way
to safely scan past it as a string, because both ebcdic and ascii have
unmapped code points from the code units that could appear in
packed-decimal number bytes.
All the above are consistent with implementations that attempt to convert
data from code-units (bytes) to Unicode codepoints using a character set
decoder, and then apply any scanning/searching only in the Unicode realm,
and work in one pass over the data.
I currently think that to implement both rawbytes and encodingErrorPolicy
it will require two passes over the data. I expect this overhead to be
negligible, so I'm not worried about it really.
...mike
On Mon, Jul 29, 2013 at 7:18 AM, Tim Kimber <KIMBERT at uk.ibm.com> wrote:
Hi Mike,
Good - that was almost exactly the reply that I was expecting. I now
understand exactly where you are coming from, and how we arrived at this
position.
First, a few statements that I think are true. I want to establish some
basic ground rules before we decide how to go forward:
a) it is desirable for DFDL ( more accurately, a valid subset of DFDL ) to
be implementable using well known parsing techniques. I think that pretty
much means regular expressions and BNF-style grammars. That implies that
it might be possible to implement a DFDL parser using one of the
well-known parser-generator technologies like Yacc/Bison/JavaCC/Antlr. I'm
not claiming that it *is* possible, but I think it would be a good thing
if it was.
b) It is technically feasible to implement a DFDL parser that can handle
the mixed-encoding example using regex technology. However, it would not
be a very efficient implementation because the scanner would have to scan
the data once per encoding, and it would have to do that for every
character after the end of field1.
c) It is possible to produce an efficient scanner that handles mixed
encodings. Such a scanner cannot use regex technology for scanning - in
fact, I think the only efficient implementation is to convert all
terminating markup into byte sequences and then perform all scanning in
the byte domain. This is what the IBM implementation does.
The scenario in my example is not entirely far-fetched - it is conceivable
that the encoding might change mid-way through a document, and I think
Steve came up with a real-world format that required this ( I have a hazy
memory of discussing this a couple of years ago ). The requirement for the
raw byte entity is not for this case - it is for the case where the
delimiters are genuinely not characters ( e.g. a UTF-16 data stream
terminated by a single null byte ). However, it is not easy to come up
with realistic examples where the raw byte entity could not be translated
into a character before the processor uses it. I think that's where some
of the confusion has arisen.
We have already agreed to make the raw byte entity an optional feature. We
should consider disallowing mixed encodings when lengthKind is delimited.
If we cannot do that then I agree with Mike that the descriptions and
definitions in the spec need to be made a bit clearer.
regards,
Tim Kimber, DFDL Team,
Hursley, UK
Internet: kimbert at uk.ibm.com
Tel. 01962-816742
Internal tel. 37246742
From: Mike Beckerle <mbeckerle.dfdl at gmail.com>
To: Tim Kimber/UK/IBM at IBMGB,
Cc: Steve Hanson/UK/IBM at IBMGB
Date: 26/07/2013 22:49
Subject: Re: clarification needed for scan and scannable
Clearly we have to rework the description and definitions because my
understanding of the current spec would say your schema is not valid
because it is not scannable. Everything in the scope of the enclosing
element which is delimited must be scannable and that means uniform
encoding.
It is exactly to rule out this mixed encoding ambiguity that we have the
restriction.
I have no idea how to implement delimiters without this restriction.
I think the case you have here is what raw bytes are for. Though I am no
longer clear on how to implement rawbytes either.
On Jul 25, 2013 6:40 PM, "Tim Kimber" <KIMBERT at uk.ibm.com> wrote:
Suppose I have this DFDL XSD:
<xs:element name="documentRoot" dfdl:encoding="US-ASCII" dfdl:lengthKind=
"delimited" dfdl:terminator="]]]">
<xs:complexType>
<xs:sequence>
<xs:element name="delimitedRecord" maxOccurs="unbounded"
dfdl:encoding="US-ASCII" dfdl:lengthKind="delimited"
dfdl:terminator="%LF;">
<xs:complexType>
<xs:sequence dfdl:separator="," dfdl:separatorSuppressionPolicy=
"suppressedAtEndLax" dfdl:encoding="EBCDIC-US">
<xs:element name="field1" type="xs:string" dfdl:lengthKind=
"delimited">
<xs:element name="field2" type="xs:string" dfdl:lengthKind=
"delimited" minOccurs="0">
<xs:element name="field3" type="xs:string" dfdl:lengthKind=
"delimited" minOccurs="0">
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
...which will parse some data like this:
field1Value,field2Value,field3Value
field1Value,
field1Value,field2Value]]]
...except that the commas that delimit the fields are in EBCDIC and the
linefeeds that delimit the records are in ASCII .
I have purposely constructed this example with separatorSuppressionPolicy
set to 'suppressedAtEndLax'. This means that field1 and field2 could both
be terminated by either of
a) an EBCDIC comma or
b) an ASCII line feed
This is a very artificial example, but it is valid DFDL and should not
produce schema definition error. If we use the term 'scan' when discussing
lengthKind='delimited' then we need to be careful how to define the term
'scannable' - otherwise we might appear to prohibit things that are
actually valid. I think this is the same point that Steve was making.
Most implementations will do the same as Daffodil and will use a regex
engine to implement all types of scanning, including
lengthKind='delimited'. It's the most natural solution, and it can be made
to work as long as
a) the implementation takes extra care if/when the encoding changes within
a component and
b) the implementation either does not support the raw byte entity, or only
supports it when it can be translated to a valid character in the
component's encoding.
There is huge scope for confusion in this area - most readers of the DFDL
specification will assume that delimited scanning can be implemented using
regex technology + EBNF grammar. It may be possible ( I'm not claiming
that it is btw ), but the grammar would be decidedly non-trivial. so the
DFDL specification should take extra care to be unambiguous when
discussing how to scan for terminating markup.
regards,
Tim Kimber, DFDL Team,
Hursley, UK
Internet: kimbert at uk.ibm.com
Tel. 01962-816742
Internal tel. 37246742
From: Mike Beckerle <mbeckerle.dfdl at gmail.com>
To: Steve Hanson/UK/IBM at IBMGB,
Cc: Tim Kimber/UK/IBM at IBMGB
Date: 25/07/2013 19:04
Subject: clarification needed for scan and scannable
If a user has read the Glossary and noted the definition of 'scannable'
then when he sees the terms 'scanning' and 'scanned' in 12.3.2 he may
think that this implies data being read by lengthKind must be 'scannable',
and it is not so. That's how I read the spec, hence my original comment.
I find it confusing to define 'scan' and then not to define 'scannable'
as 'able to be scanned'.
Ok, so i am rethinking how we express what we mean by scan.
scan - verb. The action of attempting to find something in characters.
Implies decoding the characters from code units to character codes.
A scan can succeed on real data, or fail (no match, not found), but both
are normal behaviors for a scan.
However, this is predicated on the data at least being meaningfully
decoded into characters of a single character set encoding. This is
because our regex technology isn't expected to be able to shift encodings
mid match/scan, nor is it expected to be able to jump over/around things
that aren't textual so as not to misinterpret them as code units of
characters.
So saying a schema component is scannable - means the schema expresses the
assumption that the data is expected to be all textual/characters in a
uniform encoding so that it is meaningful to talk about scanning it with
the kind of regex technology we have available today. That is, there's no
need to consider encoding changes mid match, nor what data to jump
over/around.
This is the sense of "scan" -able that we use the term to mean. Our
expectation as expressed in the schema, is that so long as the data
decodes without error the scan will either succeed or fail normally.
Actual data is scannable if it meets this expectation.
Contrast with what happens for an assert with testKind='pattern'. In that
case we scan, but DFDL processors aren't required to provide any guarantee
that the data is scannable, which means the designer of that regex pattern
must really know the data, and know what they are doing, as to avoid the
issues of decode-errors.
Why do we do this? Because we don't want to have to refactor a complex
type element into two sequences one of which is the stuff at the front
that an assert with test pattern examines, and the other of which is the
stuff after that. In some situations it may be very awkward or impossible
to separate out the part at the front that we need to scrutinize. Instead
we just allow you to put the assert at the beginning, and it is up to the
regex designer to know that the first few fields it looks at the data for,
are scannable (in the sense of the expectation of them being text), or if
not scannable in terms of that expectation from the schema, then at least
that the actual contents of them will not have data that interpreted as
character code points, will cause decode errors. In other words the writer
of a test pattern has to know that either the data is scannable (schema
says so), or that the actual data will actually be scan-able, in that
decode errors won't occur if it is decoded.
This also provides a backdoor by which one can use regex technology to
match against binary data. But then you really have to understand the data
so as to understand exactly what code units will be found in it, so one
can understand what the regex technology will do. One can even play games
with encoding="iso-8859-1", where no decoding errors are possible, so as
to guarantee no decode errors when scanning binary data that really
shouldn't be thought of as text.
The part of the data that a test pattern of an assert actually looks at
needs to be scannable by schema expectation, or scannable in actual data
contents.
If you write a test pattern, expecting to match things in what will turn
out to be the first and second elements of a complex type, but the
encoding is different for the second element, then your test pattern isn't
going to work (you may get false matches, false non-matches, or may get
decode errors) because the assumption that the data is scannable for that
first and second element, is violated.
Summary: can we just say
scan - verb
scannable - a schema component is scannable if ... current definition.
Data is scannable with respect to a specific character set encoding if,
when it is interpreted as a sequence of code units, they always decode
without error.
There are two ways to make "any data" scannable for certain. One - set
encodingErrorPolicy to 'replace'. Now all data will decode without error
because you get substitution characters insteead.
Two - change encoding to iso-8859-1 or other supported encoding where
every byte is a legal code unit.
If you are in neither situation One or Two above, then you have to say the
data is scannable, but with respect to a specific encoding. So saying the
data is scannable as ASCII means that it will not contain any bytes that
interpreted as ASCII code units will be illegal. For this example, that
means all the bytes have values from 0 to 127 (high bit not set). If you
know that will be true of your data, then even if it is binary data you
can say it is scannable as ASCII.
The representation of Packed Decimal numbers are not scannable as ascii,
for example, because any high-nibble digit 8 or greater will cause the
byte containing it to have a value 128 or higher, for which there is no
corresponding ascii code unit.
Packed Decimal numbers are also not scannable as UTF-8, because many
packed decimal bytes will not be legal utf-8 code unit values, nor will
they be in the ordered arrangements UTF-8 requires.
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
--
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130814/9ae8e9ea/attachment-0001.html>
More information about the dfdl-wg
mailing list