[DFDL-WG] Notes from 2007-09-12 call

Wed Sep 12 15:20:46 CDT 2007

Mike Beckerle, Alan Powell, Steve Hanson, Suman Kalia attended.

Discussed these questions from Alan about expression language.

        1. Accessing hidden values - it seems inconsistent to allow access 
to hidden values when xpath is used within the DFDL domain but not when 
used outside. 

        2. Where xpath is allowed in the schema - It is currently allowed 
in an arbitrary set of properties (initiator, terminator, separator, 
occurseparator, null, etc ). Why not allow it everywhere? 

Wr.t. (1) we decided this is correct. path expressions for dfdl properties 
can see hidden elements, path expressions in other places (e.g., 
schematron assertions) cannot.

Wr.t (2) we decided that expressions should be allowed in principle 
everywhere for the value of any property; however, there may be exceptions 
for certain properties. Particularly, it seems some enum-valued properties 
are unlikely to ever want to be expressions. Example: dfdl:representation. 

However, it was also pointed out that once we put selectors back into the 
language you can interleave multiple formats in the same schema, and for 
any enumerated property you could just have one selector-chosen format for 
each possible value of the enumerated property. 

The reason we don't want a blanket statement that you can have expressions 
anywhere you need a property value is that there is some potential that 
this makes implementations unnecessarily complex due to the excess 
flexibility. 

Digression: (This added by MikeB - was not part of the call today.)
Consider
       dfdl:byteOrder=" if (../../x = 'B') then 'bigEndian' else if 
(../../x='L') then 'littleEndian' else 'I don't know' }"

DFDL implementations must be prepared to cope with recieving "I don't 
know" as the proposed value for the byteOrder. This is a schema definition 
error, but it is happening at run time so becomes a processing error.  The 
only way to rule this out is to treat enumerated property values not as 
strings but as an enum type and force the expressions that compute them to 
return an enum type, not a string.
This is a kind of type inference I had hoped implementations would not 
need.

Selectors have the advantage of being statically verifiable. i.e., each 
selected format is known to use a value of the enum that is valid or a 
diagnostic could be issued by the DFDL processor. If we allow an arbitrary 
expression to return the value of an enumerated property then it 
presumably could also return a nonsense value:

We discussed proposals circulated by MikeB:

Here's an update to the first one. We decided sequences shouldn't be 
another way to carry opaque data. Easy and conservative way to fix this is 
to require the length of an empty sequence to be zero.

Second proposal to eliminate hexBinary and base64Binary was discussed 
lightly. It was suggested that one could have both, and that would make it 
easy to explain what the hexBinary type is, because it is a shorthand for 
a string with encoding="hex", and similarly for base64Binary. We did not 
resolve this issue on the call.

Finally, we discussed regular expression features for DFDL. 

There does appear to be need for regexp features to support parsing data 
which is delimited by changing data content. E.g. consider "12345Mike 
Beckerle". and a two-element sequence. One is a number which continues 
until the first non-digit character. The other is a string which begins 
with a non-digit character. Regexp length appears to be a good way to 
handle this kind of thing.

Alan Powell has the action item to talk with the IBM internal TX product 
group. They have a speculative parser and so have fewer regular-expression 
features in their language. We want to understand how they deal with the 
header, body[], trailer use case. This case is where the data is lines of 
text, the header is the first line, the trailer is the last line, the body 
records are everything in between and there's no content that can be used 
to distinguish the record types. This is handled in some 
format-description systems with regexp features. In TX this is handled by 
speculative parsing and we want to understand how this comes out and if it 
is preferable to adding regexp features.

Mike Beckerle
STSM, Architect, Scalable Computing
IBM Software Group
Information Platform and Solutions
Westborough, MA 01581
direct: voice and FAX 508-599-7148
assistant: Pam Riordan 
                  priordan at us.ibm.com 
                  508-599-7046

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20070912/c5971ee6/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: proposal-to-simplify-opaque-types-v4.doc
Type: application/octet-stream
Size: 49664 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/dfdl-wg/attachments/20070912/c5971ee6/attachment-0001.obj