[DFDL-WG] Remaining 037 review issues for today WG call (20 Jan)
Alan Powell
alan_powell at uk.ibm.com
Wed Jan 20 06:01:38 CST 2010
I have answered most of the issues and comments raised by Steve and Mike
but some need further discussion.
Issues from Steve H
General. Although dfdl:encoding enums are case insensitive, we should
stick to UC throughout in examples.
2. I agree with the existing comment that the RFC2119 key words should be
upper case.
14.3.4. There are type/rep combinations where lengthKind="implicit" is not
allowed - so saying that 'pattern' is replaced by 'implicit' on unparsing
does not work.
TBD
16.2. I'm not sure that scannability in this constant encoding sense is
necessary for patterns. I can create a regular expression that extracts
all characters up to hex value xXX or all characters up to xYY, thereby
treating the content as an encoding in-sensitive black box.
Issues from Mike B
· Tracker issue: codepoints outside BMP, as literals and in data.
· If I put in a value that requires use of a high/low surrogate
pair, is that an error, does it require me to put in two separate %#...;
thingys, one for each of the surrogates (in which case these are not
really code points in ISO10646). If I put in a codepoint for one of the
supplemental characters and the schema itself is written in UTF-16 then
that has to translate into literal surrogate pair. Ok, but I?m very
uncertain about all this stuffTracker Issue: illegal character encodings
for parsing and unparsing. TBD: how do these make it into the infoset or
are they replaced, and if so how TBD: can one represent these in the
infoset for output? Ideally not, but?
· Tracker Issue: Processing-time Schema Definition Errors
This section (2.3.1 in this draft), is problematic as we?re trying to
allow simple DFDL implementations to not do a bunch of static checking,
yet if implementations differ on when Schema Definition errors are
detected, then the second paragraph says they are converted to processing
errors. This lets different implementations do very different things in
terms of how the speculative parsing back-tracks around.
Grammar ambiguity is a very tricky case. Unless a DFDL implementation can
prove a grammar to be unambiguous, then it is very hard to say that any
particular combinatino of delimiters make up a legal DFDL schema
definition. If the parser simply fails because the grammar was ambiguous,
there?s no way to tell the difference between this and just broken data
without proving the grammar is unambiguous. In general it is formally
undecidable whether a grammar is ambiguous or unambiguous. (
http://books.google.com/books?id=lIuu53IcKWoC&pg=PT217&lpg=PT217&dq=proving+a+grammar+is+unambiguous&source=bl&ots=wie8TAt-MT&sig=ZSD7tIwnXZIT8Ic91BWMH2H2dKg&hl=en&ei=hAQ5S5vPOIri7APc37CKBg&sa=X&oi=book_result&ct=result&resnum=10&ved=0CDAQ6AEwCQ#v=onepage&q=proving%20a%20grammar%20is%20unambiguous&f=false
)
Since DFDL v1.0 doesn?t allow recursive declarations/definitions, it may
be possible to provide the ambiguity or unambiguity of a DFDL schema (or
rather, the data syntax grammar described by it ? if you want to bother to
distinguish the two), but recursion isn?t something we want to rule out
for the future, so
Type checking is decidable in DFDL?s expression language, so we could
always detect type safety before run time; however, if we allow a
simplistic DFDL implementation to just check types at run time then this
would, by the definition in this section (2.3.1), issue processing errors
when it detects these at run time, thereby allowing backtracking of the
speculative parser to be driven off of type-checks in the expression
language. It seems to me that we need to find a way to put this problem
back into the hands of the user, and say that a schema where this actually
matters (one where a type error causes a backtrack, which ultimately
causes a successful parse) are illegal but implementations are allowed to
not detect this particular illegality.
It seems to me we need to put this problem back into the hands of the
user.
· Tracker Issue: "round trip" for infoset. Should we omit the
whole point?
· Tracker Issue: [schema] is an absolute or relative SCD. Why
bother allowing absolute?
· Tracker Issue: Glossary as the place for centralized
definitions, or should they be repeated there, but also introduced at
point of first use, or should we put the definitions only at the places
where they are discussed, and xref from the glossary?
· TBD: Issue - semantics of expressions containing relative paths
that are inherited via ref to a dfdl:defineFormat. (also section 10.3)
· TBD: Issue - XPath term - we are not consistent about using the
term XPath, or "expression" when referring to our expression language. I
prefer to call it our expression language, and then in the section that
defines it state that it is a strict subset of XPath 2.0.
· TBD: Issue - fn:position is unclear given that we've just said
we don't support sequences in the expression language.
· TBD: Issue - order of sections. Scoping rules section should
come before variables section, which uses these concepts.
TBD: Issue: Case sensitivity of enum names - did we say whether this is
case sensitive or not? I believe it should be case sensitive.
· Issue: dfdl:representation - Strings in binary rep. I see no
reason why elements of type xs:string will examine dfdl:representation.
They shouldn?t' care what it is, they are always "text". I should be able
to specify a bunch of inter-mixed binary number and string elements
without having to specify dfdl:representation="text' just to avoid an
error on the string type elements. I believe xs:string type ignores
dfdl:representation (always behaves as if dfdl:representation is
'text').(If we change this then the property precedence section for
simpletypes changes slightly as representation="text" is implied if type
is string.)
That will make it impossible to introduce a binary representation of text
later
textStringPadCharacter textNumberPadCharacter - did we agree that this
character must be a "minimum width" character if the char set encoding is
variable width? (i.e., the pad char must be 1 byte if the encoding is
UTF-8.
numberInfinityRep numberNanRep - Is this applicable only to xs:double and
xs:float? Also, what I've seen requires a distinction of sign. I.e., there
are positive and negative infinities often printing as -inf and +inf.
· TBD: Issue - \n in regular expressions - clarify relationship of
this to entities like NL entity. Also, if I include an entity like WSP* in
a regular expression (can I?) does it then match accordingly?
It appears that some of our multi-valued entities like WSP+ create
conditional "matching" behavior without having to use regular expressions,
e.g., when WSP+ is used as a separator. But can you use entities like WSP+
in a regular expression? It seems you should be able to use regular
"single valued" entities in a regular expression, its these multi-valued
ones that have tricky semantics.
Added Unicode values to /n, /t,/r. Disallow DFDL entities in regular
expressions.
14.1 Alignment - TBD: Issue - zero-based thinking here. But all the bits
stuff and everything else in DFDL uses 1-based reasoning. Need to revisit
to make this sensible for 1 based world.
Added implicit alignment table. TBD zero-based
finalTerminatorCanBeMissing - spec is not clear. Also is there a
finalSeparatorCanBeMissing
Chaned to finalDocumentTerminatorCanBeMissing and
finalDocumentSeparatorCanBeMissing. Not sure where
finalDocumentSeparatorCanBeMissing should be specified. Looks odd on
'distinguished root'. These properties operate differently from other
properties as they are defined on the 'distinguished root' but affect some
lower down element. Effectively they are put in scope by a different
mechanism
Alan Powell
MP 211, IBM UK Labs, Hursley, Winchester, SO21 2JN, England
Notes Id: Alan Powell/UK/IBM email: alan_powell at uk.ibm.com
Tel: +44 (0)1962 815073 Fax: +44 (0)1962 816898
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20100120/2666b7fb/attachment.html
More information about the dfdl-wg
mailing list