[DFDL-WG] Agenda for OGF DFDL WG call 02 February 2010- 13:00 UK (8:00 ET)

Alan Powell alan_powell at uk.ibm.com
Mon Feb 1 11:46:34 CST 2010


1. Discriminators
Review two options (attached) 


2. Remaining 037 review issues 
See below

3. Go through Actions 
 
Current Actions:
No
Action 


045
20/05 AP: Speculative Parsing
27/05: Psuedo code has been circulated. Review for next call
03/06: Comments received and will be incorporated
09/06: Progress but not discussed
17/06: Discussed briefly
24/06: No Progress
01/07: No Progress
15/07: No progress. MB not happy with the way the algorithm is documented, 
need to find a better way.
29/07: No Progress 
05/08: No Progress. Will document behaviour as a set of rules.
12/08: No Progress 
...
16/09: no progress
30/09: AP distributed proposal and others commented. Brief discussion AP 
to incorporate update and reissue
07/10: Updated proposal was discussed.Comments will be incorporated into 
the next version.
14/10: Alan to update proposal to include array scenario where minOccurs > 
0
21/10: Updated proposal reviewed
28/10: Updated proposal reviewed see minutes
04/11: Discussed semantics of disciminators on arrays. MB to produce 
examples
11/11: Absorbing action 033 into 045.  Maybe decorated discrminator kinds 
are needed after all. MB and SF to continue with examples. 
18/11: Went through WTX implementation of example. SF to gather more 
documentation about WTX discriminator rules.
25/11: Further discussion. Will get more WTX documentation. Need to 
confirm that no changes need to Resolving Uncertainty doc.
04/11: Further discussion about arrays.
09/12: Reviewed proposed discriminator semantic.
16/12: Reviewed discriminator examples and WTX semantic.
23/12: SF to provide better description of WTX behaviour and invite B 
Connolley to next call
06/01:B Connolly not available. SF to provide more complete description.
13/01: Stephaine took us through a description of WTX identifiers. Mike 
agreed to write up in DFDL terms.
20/01: Mike will write up
27/01: further discussion of discriminators
29/01: Alan had  emailed bot proposals but not enough time to discuss
049
20/05 AP Built-in specification description and schemas
03/06: not discussed
24/06: No Progress
24/06: No Progress (hope to get these from test cases)
15/07: No progress. Once available, the examples in the spec should use 
the dfdl:defineFormat annotations they provide.
...
14/10: no progress
21/10: Discussed the real need for this being in the specification. It 
seemed that the main value is it define a schema location for downloading 
'known' defaults from the web. 
28/10: no progress
04/11: no progress
11/11: no update
18/11: no update
25/11: Agreed to try to produce for CSV and fixed formats
04/12: no update
09/12: no update
16/12: no update
23/12: no update
06/01: no progress. If there is no resource to complete this action it can 
be deferred
13/01:no progress
20/01: no progress
27/01: no progress
29/01: No progress.  The predefined formats do not need to be available 
when the spec is published.
Suman said that he had been mapping COBOL structures to DFDL and it didn't 
look as though the way text numbers are define is very usable. He will 
document for next call 
066
Investigate format for defining test cases
25/11:IBM to see if it is possible to publish its test case format.
04/12: no update
09/12: no update
16/12: reminded dent to project manager
23/12: SH will send another reminder.
06/01: Another reminder will be sent
13/01: no update
20/01: no update
27/01: no progress
29/01: no progress
077
SKK:  mapping of COBOL numbers to textNumberFormats.





A few comments in-line below

On Wed, Jan 20, 2010 at 7:01 AM, Alan Powell <alan_powell at uk.ibm.com> 
wrote:

I have answered most of the issues and comments raised by Steve and Mike 
but some need further discussion. 


Issues from Steve H 

General. Although dfdl:encoding enums are case insensitive, we should 
stick to UC throughout in examples. 

2. I agree with the existing comment that the RFC2119 key words should be 
upper case. 

14.3.4. There are type/rep combinations where lengthKind="implicit" is not 
allowed - so saying that 'pattern' is replaced by 'implicit' on unparsing 
does not work. 
TBD 

We covered this on the most recent wg call.
 

16.2. I'm not sure that scannability in this constant encoding sense is 
necessary for patterns. I can create a regular expression that extracts 
all characters up to hex value xXX or all characters up to xYY, thereby 
treating the content as an encoding in-sensitive black box. 

If your byte pattern happens to be a legal part of a multi-byte character 
sequence, then you'll get a false recognition, or you won't get what you 
expect.

Example: You are searching for byte 0xAA, but that can legally appear as 
byte 3 of a 3-byte UTF-8 encoded character. When you say you are looking 
for hex AA in a string, DFDL is currently defined to mean you are looking 
for the character reprsented by that raw byte. If the encoding is UTF-8, 
that isn't a legal character encoding sequence even, so the decoder should 
cause an error or something.

Even for a fixed length single byte character set, you have to have no 
unused code that have no mapping to ISO 10646, because our infoset is 
defined in terms of translations into that.

I think we need encoding="none" or encoding="bytes" or something if you 
really want to scan bytes without encoding causing problems.
 


Issues from Mike B 
·         Tracker issue: codepoints outside BMP, as literals and in data. 
·         If I put in a value that requires use of a high/low surrogate 
pair, is that an error, does it require me to put in two separate %#...; 
thingys, one for each of the surrogates (in which case these are not 
really code points in ISO10646). If I put in a codepoint for one of the 
supplemental characters and the schema itself is written in UTF-16 then 
that has to translate into literal surrogate pair. Ok, but I?m very 
uncertain about all this stuff
The above item had two issues glomed together. There really are two 
separate issues. The above is about these crazy codepoints that use 
surrogate pairs. That's a minor corner case given the amount of use those 
get.
The bigger issue is the one below, which is about things that either are 
in strings and are broken character encodings, but we still need to be 
able to process the data. There's also the matter of recovery from errors 
in decoding, and what we put out when the infoset contains a character 
code where there is no valid encoding, or just a character code which 
isn't even in ISO 10646 (e.g., character code 0xFFFFFFFF, which is not a 
valid character at all.
Tracker Issue: illegal character encodings for parsing and unparsing. TBD: 
how do these make it into the infoset or are they replaced, and if so how 
TBD: can one represent these in the infoset for output? Ideally not, but?
 
·         Tracker Issue: Processing-time Schema Definition Errors 
This section (2.3.1 in this draft), is problematic as we?re trying to 
allow simple DFDL implementations to not do a bunch of static checking, 
yet if implementations differ on when Schema Definition errors are 
detected, then the second paragraph says they are converted to processing 
errors. This lets different implementations do very different things in 
terms of how the speculative parsing back-tracks around. 
Grammar ambiguity is a very tricky case. Unless a DFDL implementation can 
prove a grammar to be unambiguous, then it is very hard to say that any 
particular combinatino of delimiters make up a legal DFDL schema 
definition. If the parser simply fails because the grammar was ambiguous, 
there?s no way to tell the difference between this and just broken data 
without proving the grammar is unambiguous. In general it is formally 
undecidable whether a grammar is ambiguous or unambiguous. (
http://books.google.com/books?id=lIuu53IcKWoC&pg=PT217&lpg=PT217&dq=proving+a+grammar+is+unambiguous&source=bl&ots=wie8TAt-MT&sig=ZSD7tIwnXZIT8Ic91BWMH2H2dKg&hl=en&ei=hAQ5S5vPOIri7APc37CKBg&sa=X&oi=book_result&ct=result&resnum=10&ved=0CDAQ6AEwCQ#v=onepage&q=proving%20a%20grammar%20is%20unambiguous&f=false
) 
Since DFDL v1.0 doesn?t allow recursive declarations/definitions, it may 
be possible to provide the ambiguity or unambiguity of a DFDL schema (or 
rather, the data syntax grammar described by it ? if you want to bother to 
distinguish the two), but recursion isn?t something we want to rule out 
for the future, so 
Type checking is decidable in DFDL?s expression language, so we could 
always detect type safety before run time; however, if we allow a 
simplistic DFDL implementation to just check types at run time then this 
would, by the definition in this section (2.3.1), issue processing errors 
when it detects these at run time, thereby allowing backtracking of the 
speculative parser to be driven off of type-checks in the expression 
language.  It seems to me that we need to find a way to put this problem 
back into the hands of the user, and say that a schema where this actually 
matters (one where a type error causes a backtrack, which ultimately 
causes a successful parse) are illegal but implementations are allowed to 
not detect this particular illegality. 
It seems to me we need to put this problem back into the hands of the 
user. 
·         Tracker Issue: "round trip" for infoset. Should we omit the 
whole point? 
·         Tracker Issue: [schema] is an absolute or relative SCD. Why 
bother allowing absolute? 
·         Tracker Issue: Glossary as the place for centralized 
definitions, or should they be repeated there, but also introduced at 
point of first use, or should we put the definitions only at the places 
where they are discussed, and xref from the glossary? 
·         TBD: Issue - semantics of expressions containing relative paths 
that are inherited via ref to a dfdl:defineFormat. (also section 10.3)

·         TBD: Issue - XPath term - we are not consistent about using the 
term XPath, or "expression" when referring to our expression language. I 
prefer to call it our expression language, and then in the section that 
defines it state that it is a strict subset of XPath 2.0.

·         TBD: Issue - fn:position is unclear given that we've just said 
we don't support sequences in the expression language. 
·         TBD: Issue - order of sections. Scoping rules section should 
come before variables section, which uses these concepts. 
TBD: Issue: Case sensitivity of enum names - did we say whether this is 
case sensitive or not? I believe it should be case sensitive. 

·          Issue: dfdl:representation - Strings in binary rep. I see no 
reason why elements of type xs:string will examine dfdl:representation. 
They shouldn?t' care what it is, they are always "text". I should be able 
to specify a bunch of inter-mixed binary number and string elements 
without having to specify dfdl:representation="text' just to avoid an 
error on the string type elements. I believe xs:string type ignores 
dfdl:representation (always behaves as if dfdl:representation is 
'text').(If we change this then the property precedence section for 
simpletypes changes slightly as representation="text" is implied if type 
is string.)
That will make it impossible to introduce a binary representation of text 
later

What is "a binary representation of text"? Is there a real issue here. 
This is a primary convenience and clarity issue for me. I do not want to 
have to change to representation="text" for every string inside a cobol 
structure, which is ultimately a binary representation object. To me 
type="string" is enough. I want to put in the file scope level of the 
schema a representation="binary", and then decorate the elements with the 
specifics of their types, but I do not expect to have to put 
representation="text" on anything.

I do not understand what you are trying to achieve by requiring 
representation="text" for things that are already textual based on the 
type. 

The rest of the issues below I think we need to discuss on calls.

textStringPadCharacter textNumberPadCharacter - did we agree that this 
character must be a "minimum width" character if the char set encoding is 
variable width? (i.e., the pad char must be 1 byte if the encoding is 
UTF-8. 

numberInfinityRep numberNanRep - Is this applicable only to xs:double and 
xs:float? Also, what I've seen requires a distinction of sign. I.e., there 
are positive and negative infinities often printing as -inf and +inf. 
·         TBD: Issue - \n in regular expressions - clarify relationship of 
this to entities like NL entity. Also, if I include an entity like WSP* in 
a regular expression (can I?) does it then match accordingly? 
It appears that some of our multi-valued entities like WSP+ create 
conditional "matching" behavior without having to use regular expressions, 
e.g., when WSP+ is used as a separator. But can you use entities like WSP+ 
in a regular expression? It seems you should be able to use regular 
"single valued" entities in a regular expression, its these multi-valued 
ones that have tricky semantics. 
Added Unicode values to /n, /t,/r.  Disallow DFDL entities in regular 
expressions. 
14.1 Alignment - TBD: Issue - zero-based thinking here. But all the bits 
stuff and everything else in DFDL uses 1-based reasoning. Need to revisit 
to make this sensible for 1 based world. 
Added implicit alignment table. TBD zero-based
finalTerminatorCanBeMissing - spec is not clear. Also is there a 
finalSeparatorCanBeMissing 
Chaned to finalDocumentTerminatorCanBeMissing and  
finalDocumentSeparatorCanBeMissing. Not sure where 
finalDocumentSeparatorCanBeMissing should be specified. Looks odd on 
'distinguished root'. These properties operate differently from other 
properties as they are defined on the 'distinguished root' but affect some 
lower down element. Effectively they are put in scope by a different 
mechanism



 
Regards

 
Alan Powell
 
Development - MQSeries, Message Broker, ESB
IBM Software Group, Application and Integration Middleware Software
-------------------------------------------------------------------------------------------------------------------------------------------
IBM
MP211, Hursley Park
Hursley, SO21 2JN
United Kingdom
Phone: +44-1962-815073
e-mail: alan_powell at uk.ibm.com






Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU





-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20100201/476fab85/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Resolving Uncertainty and Discriminators-	parent exists- v3.doc
Type: application/octet-stream
Size: 60928 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/dfdl-wg/attachments/20100201/476fab85/attachment-0002.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Resolving Uncertainty and Discriminators-	component exists- v3.doc
Type: application/octet-stream
Size: 71168 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/dfdl-wg/attachments/20100201/476fab85/attachment-0003.obj 


More information about the dfdl-wg mailing list