[DFDL-WG] Remaining 037 review issues for today WG call (20 Jan)

Mike Beckerle mbeckerle.dfdl at gmail.com
Thu Jan 28 18:15:32 CST 2010


A few comments in-line below

On Wed, Jan 20, 2010 at 7:01 AM, Alan Powell <alan_powell at uk.ibm.com> wrote:

>
> I have answered most of the issues and comments raised by Steve and Mike
> but some need further discussion.
>
>
> *Issues from Steve H*
>
> *General*. Although dfdl:encoding enums are case insensitive, we should
> stick to UC throughout in examples.
>
> *2*. I agree with the existing comment that the RFC2119 key words should
> be upper case.
>
> *14.3.4.* There are type/rep combinations where lengthKind="implicit" is
> not allowed - so saying that 'pattern' is replaced by 'implicit' on
> unparsing does not work.
> TBD
>

We covered this on the most recent wg call.


>
> *16.2*. I'm not sure that scannability in this constant encoding sense is
> necessary for patterns. I can create a regular expression that extracts all
> characters up to hex value xXX or all characters up to xYY, thereby treating
> the content as an encoding in-sensitive black box.


If your byte pattern happens to be a legal part of a multi-byte character
sequence, then you'll get a false recognition, or you won't get what you
expect.

Example: You are searching for byte 0xAA, but that can legally appear as
byte 3 of a 3-byte UTF-8 encoded character. When you say you are looking for
hex AA in a string, DFDL is currently defined to mean you are looking for
the character reprsented by that raw byte. If the encoding is UTF-8, that
isn't a legal character encoding sequence even, so the decoder should cause
an error or something.

Even for a fixed length single byte character set, you have to have no
unused code that have no mapping to ISO 10646, because our infoset is
defined in terms of translations into that.

I think we need encoding="none" or encoding="bytes" or something if you
really want to scan bytes without encoding causing problems.


>
>
> *Issues from Mike B*
>
> ·         Tracker issue: codepoints outside BMP, as literals and in data.
>
> ·         If I put in a value that requires use of a high/low surrogate
> pair, is that an error, does it require me to put in two separate %#...;
> thingys, one for each of the surrogates (in which case these are not really
> code points in ISO10646). If I put in a codepoint for one of the
> supplemental characters and the schema itself is written in UTF-16 then that
> has to translate into literal surrogate pair. Ok, but I’m very uncertain
> about all this stuff
>
The above item had two issues glomed together. There really are two separate
issues. The above is about these crazy codepoints that use surrogate pairs.
That's a minor corner case given the amount of use those get.
The bigger issue is the one below, which is about things that either are in
strings and are broken character encodings, but we still need to be able to
process the data. There's also the matter of recovery from errors in
decoding, and what we put out when the infoset contains a character code
where there is no valid encoding, or just a character code which isn't even
in ISO 10646 (e.g., character code 0xFFFFFFFF, which is not a valid
character at all.

> Tracker Issue: illegal character encodings for parsing and unparsing. TBD:
> how do these make it into the infoset or are they replaced, and if so how
> TBD: can one represent these in the infoset for output? Ideally not, but…
>


> ·         Tracker Issue: Processing-time Schema Definition Errors
>
> This section (2.3.1 in this draft), is problematic as we’re trying to allow
> simple DFDL implementations to not do a bunch of static checking, yet if
> implementations differ on when Schema Definition errors are detected, then
> the second paragraph says they are converted to processing errors. This lets
> different implementations do very different things in terms of how the
> speculative parsing back-tracks around.
>
> Grammar ambiguity is a very tricky case. Unless a DFDL implementation can
> prove a grammar to be unambiguous, then it is very hard to say that any
> particular combinatino of delimiters make up a legal DFDL schema definition.
> If the parser simply fails because the grammar was ambiguous, there’s no way
> to tell the difference between this and just broken data without proving the
> grammar is unambiguous. In general it is formally undecidable whether a
> grammar is ambiguous or unambiguous. (*
> http://books.google.com/books?id=lIuu53IcKWoC&pg=PT217&lpg=PT217&dq=proving+a+grammar+is+unambiguous&source=bl&ots=wie8TAt-MT&sig=ZSD7tIwnXZIT8Ic91BWMH2H2dKg&hl=en&ei=hAQ5S5vPOIri7APc37CKBg&sa=X&oi=book_result&ct=result&resnum=10&ved=0CDAQ6AEwCQ#v=onepage&q=proving%20a%20grammar%20is%20unambiguous&f=false
> *<http://books.google.com/books?id=lIuu53IcKWoC&pg=PT217&lpg=PT217&dq=proving+a+grammar+is+unambiguous&source=bl&ots=wie8TAt-MT&sig=ZSD7tIwnXZIT8Ic91BWMH2H2dKg&hl=en&ei=hAQ5S5vPOIri7APc37CKBg&sa=X&oi=book_result&ct=result&resnum=10&ved=0CDAQ6AEwCQ%5Clv=onepage&q=proving%20a%20grammar%20is%20unambiguous&f=false%5Ct>)
>
>
> Since DFDL v1.0 doesn’t allow recursive declarations/definitions, it may be
> possible to provide the ambiguity or unambiguity of a DFDL schema (or
> rather, the data syntax grammar described by it – if you want to bother to
> distinguish the two), but recursion isn’t something we want to rule out for
> the future, so
>
> Type checking is decidable in DFDL’s expression language, so we could
> always detect type safety before run time; however, if we allow a simplistic
> DFDL implementation to just check types at run time then this would, by the
> definition in this section (2.3.1), issue processing errors when it detects
> these at run time, thereby allowing backtracking of the speculative parser
> to be driven off of type-checks in the expression language.  It seems to me
> that we need to find a way to put this problem back into the hands of the
> user, and say that a schema where this actually matters (one where a type
> error causes a backtrack, which ultimately causes a successful parse) are
> illegal but implementations are allowed to not detect this particular
> illegality.
>
> It seems to me we need to put this problem back into the hands of the user.
>
> ·         Tracker Issue: "round trip" for infoset. Should we omit the
> whole point?
>
> ·         Tracker Issue: [schema] is an absolute or relative SCD. Why
> bother allowing absolute?
>
> ·         Tracker Issue: Glossary as the place for centralized
> definitions, or should they be repeated there, but also introduced at point
> of first use, or should we put the definitions only at the places where they
> are discussed, and xref from the glossary?
>
> ·         TBD: Issue - semantics of expressions containing relative paths
> that are inherited via ref to a dfdl:defineFormat. (also section 10.3)
>
> ·         TBD: Issue - XPath term - we are not consistent about using the
> term XPath, or "expression" when referring to our expression language. I
> prefer to call it our expression language, and then in the section that
> defines it state that it is a strict subset of XPath 2.0.
>
> ·         TBD: Issue - fn:position is unclear given that we've just said
> we don't support sequences in the expression language.
>
> ·         TBD: Issue - order of sections. Scoping rules section should
> come before variables section, which uses these concepts.
>
>    - TBD: Issue: Case sensitivity of enum names - did we say whether this
>    is case sensitive or not? I believe it should be case sensitive.
>
>
> ·          Issue: dfdl:representation - Strings in binary rep. I see no
> reason why elements of type xs:string will examine dfdl:representation. They
> shouldn’t' care what it is, they are always "text". I should be able to
> specify a bunch of inter-mixed binary number and string elements without
> having to specify dfdl:representation="text' just to avoid an error on the
> string type elements. I believe xs:string type ignores dfdl:representation
> (always behaves as if dfdl:representation is 'text').(If we change this then
> the property precedence section for simpletypes changes slightly as
> representation="text" is implied if type is string.)
> That will make it impossible to introduce a binary representation of text
> later
>

What is "a binary representation of text"? Is there a real issue here. This
is a primary convenience and clarity issue for me. I do not want to have to
change to representation="text" for every string inside a cobol structure,
which is ultimately a binary representation object. To me type="string" is
enough. I want to put in the file scope level of the schema a
representation="binary", and then decorate the elements with the specifics
of their types, but I do not expect to have to put representation="text" on
anything.

I do not understand what you are trying to achieve by requiring
representation="text" for things that are already textual based on the
type.

The rest of the issues below I think we need to discuss on calls.


>    - textStringPadCharacter textNumberPadCharacter - did we agree that
>    this character must be a "minimum width" character if the char set encoding
>    is variable width? (i.e., the pad char must be 1 byte if the encoding is
>    UTF-8.
>
>
>
>    - numberInfinityRep numberNanRep - Is this applicable only to xs:double
>    and xs:float? Also, what I've seen requires a distinction of sign. I.e.,
>    there are positive and negative infinities often printing as -inf and +inf.
>
>    ·         TBD: Issue - \n in regular expressions - clarify relationship
>    of this to entities like NL entity. Also, if I include an entity like WSP*
>    in a regular expression (can I?) does it then match accordingly?
>
>    It appears that some of our multi-valued entities like WSP+ create
>    conditional "matching" behavior without having to use regular expressions,
>    e.g., when WSP+ is used as a separator. But can you use entities like WSP+
>    in a regular expression? It seems you should be able to use regular "single
>    valued" entities in a regular expression, its these multi-valued ones that
>    have tricky semantics.
>    Added Unicode values to /n, /t,/r.  Disallow DFDL entities in regular
>    expressions.
>    - 14.1 Alignment - TBD: Issue - zero-based thinking here. But all the
>    bits stuff and everything else in DFDL uses 1-based reasoning. Need to
>    revisit to make this sensible for 1 based world.
>    Added implicit alignment table. TBD zero-based
>
>
>
>    - finalTerminatorCanBeMissing - spec is not clear. Also is there a
>    finalSeparatorCanBeMissing
>    Chaned to finalDocumentTerminatorCanBeMissing and  finalDocumentSeparatorCanBeMissing.
>    Not sure where finalDocumentSeparatorCanBeMissing should be specified.
>    Looks odd on 'distinguished root'. These properties operate differently from
>    other properties as they are defined on the 'distinguished root' but affect
>    some lower down element. Effectively they are put in scope by a different
>    mechanism
>
>
>
> Alan Powell
>
> MP 211, IBM UK Labs, Hursley,  Winchester, SO21 2JN, England
> Notes Id: Alan Powell/UK/IBM     email: alan_powell at uk.ibm.com
> Tel: +44 (0)1962 815073                  Fax: +44 (0)1962 816898
>
>
>
>  ------------------------------
>
> *
> *
>
> *Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
> *
>
>
>
>
>
>
>
> --
>  dfdl-wg mailing list
>  dfdl-wg at ogf.org
>  http://www.ogf.org/mailman/listinfo/dfdl-wg
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20100128/27d29687/attachment-0001.html 


More information about the dfdl-wg mailing list