[DFDL-WG] DFDL Revision 033 Comments

Wed Feb 25 07:15:14 CST 2009

Hi Dave

Thanks for your comments. 

However, one general observation about DFDL.  It is not the case that the 
end result of a DFDL parse is something that is always XML compliant. We 
have been very careful to decouple parsing from transformation.  If we 
said that DFDL created something XML compliant, the next thing is to want 
xs:attributes in the DFDL model so that the act of parsing can create XML 
that includes attributes. I've worked on a project that made this mistake 
and has paid the price ever since, with an overly complicated model that 
forces users to know more about XML than they need to.  DFDL is intended 
to model and parse non-XML data. It happens to use a subset of XML Schema 
as its model for convenience. But what it produces is not XML. If you want 
to take the output from DFDL and create XML, fine, but that's a post-parse 
transformation step.

More specifically:
- The DFDL infoset is not the XML infoset. The XML infoset is not rich 
enough to carry type information, everything is character items. It can't 
represent binary data. It can't ever handle hex 00 (null).
- The DFDL infoset is not the XDM. The XDM Text node assumes that the 
content can be expressed as characters, that is not true of binary data. 
The XDM Comment node is too restrictive.
- The DFDL infoset is not the XSD PSVI. That enforces validation of the 
data as it is parsed, which is not mandated by DFDL.  There were other 
reasons too - I'll let Sandy Gao explain further.

However, I can see value in changing the DFDL infoset to be more 
compatible with XDM, where that makes sense. We can discuss further on the 
call. 

Regards

Steve Hanson
Programming Model Architect
WebSphere Message Brokers
Hursley, UK
Internet: smh at uk.ibm.com
Phone (+44)/(0) 1962-815848

"Dave Glick" <dglick at dracorp.com> 
Sent by: dfdl-wg-bounces at ogf.org
20/02/2009 21:40

To
<dfdl-wg at ogf.org>
cc

Subject
[DFDL-WG] DFDL Revision 033 Comments

All,

I have completed my review of draft 33 of the standard. I?ve read through 
the document as a whole at several times and spent a considerable amount 
of time digesting each of the sections. I?ve included nearly all my 
comments in the document (attached) but because there are so many, and 
reconciling them with ongoing revisions of the document may be difficult, 
I?ve included what I feel are the important points in this email (I 
realize there are a lot of points here, but I guess that?s what happens 
when someone takes a totally fresh look at things). I would like to note 
that these are all just suggestions ? I commented wherever I had a 
question or concern for completeness sake, but certainly don?t expect all 
of my feedback to be incorporated ? especially since many of the concerns 
may have already been discussed and addressed in previous iterations of 
the standard. My only goal and motivation is to and help make the best 
standard possible, and hopefully some of these suggestions will be food 
for thought.

General

- The document feels overly verbose and explanatory to me. There are many 
whole sections and blocks of text that, while very valuable, don?t really 
seem appropriate in a normative standards document. The document should 
explain ?what is,? not necessarily ?why it is.? I understand that it was 
previously discussed as to whether portions of the document should be 
extracted and instead included in a separate non-normative ?DFDL Primer? 
similar to the way W3C structured the XML Schema standard. My reaction is 
that doing so would help clean up the document. Using technical books as a 
metaphor, my own feeling is that a normative standard should be more like 
?The Definitive UNIX Reference? and less like ?Introduction to UNIX? or 
even ?Expert-Level UNIX?. I think the standard falls a little too far into 
the latter category right now.

- Related to the previous comment, the section that seemed the most out of 
place to me was the discussion on the parsing and unparsing processes and 
their relationship to grammar and general parsing concepts (?DFDL 
Properties Introduction?). Though the discussion was extremely valuable 
from the standpoint of a potential implementer and may actually be the 
only way to implement the standard, I think it may fall too far into the 
?how to implement? category and might be more appropriate in an appendix 
(marked as non-normative) or a primer if one is created. 

- There are certain sections that seemed a little misplaced to me. In 
every case, it became clear over time why the document was organized the 
way it was, but some revision may make it easier to digest. Generally, it 
seemed that some concepts were ?spread out? and not organized under 
encompassing umbrella sections. Though the way it?s currently structured 
may make the document more componentized, it makes it harder to understand 
from a ?where do I look for all the information on X? standpoint. The most 
obvious example is all the sections dealing with representation properties 
such as the list of representation property precedence and the sections on 
sequences, choices, etc. My thought is that since all those sections are 
really discussing different representation properties and aspects thereof 
that it seems reasonable to group them into one overarching section. There 
are other areas where I thought the organization could be improved 
including the discussion on element vs. attribute vs. short binding forms 
(seemed misplaced given that several other non-representation property 
annotation element attributes such as setVariableName can use the 
alternate forms, and also broke up the flow of annotation element 
descriptions) and the glossary (I like the idea of defining specific terms 
used broadly throughout the document to remove ambiguity, but a general 
purpose glossary feels more appropriate as an appendix). To codify my 
thoughts, I worked up a TOC that I think exhibits a more understandable 
organizational structure. I?ve attached it not because I want or expect 
the entire structure of the document to be modified, but as a ?jumping 
off? point for discussion.

- The standard references RFC 2119 for defining certain terms such as 
MUST, SHALL, etc. In most other standards I?ve seen, emphasis is placed on 
these terms when their meaning is to be taken from RFC 2119 ? I would 
suggest that DFDL do the same. It also appears as though the terms aren?t 
being used throughout the document as regularly as they could or should 
be. I would suggest that at some point in the final revision process we 
scrub the document for requirements concepts and make sure to use the 
appropriate RFC 2119 terms where possible. This should remove ambiguity 
about what?s expected from implementations.

DFDL Information Set

- I?m sure this has already been discussed at length, but I wonder if it 
would be possible to define the DFDL Information Set as an extension to 
the XPath Data Model (XDM) as XSLT does for its data model. This would 
have many advantages. The XDM is compatible with both the XML Schema PSVI 
and the XML Information Set (and the XDM standard explicitly explains the 
conversion process to and from each). This therefore provides 
interoperability with the alternate representations and uses of a DFDL 
Schema as an XML Schema and as plain XML content. Additionally, an XDM (or 
some reasonable facsimile) will have to be constructed from the DFDL 
Schema anyway to support the XPath capabilities of DFDL ? basing the DFDL 
Infoset on XDM to begin with would ensure seamless (or at least easier) 
use with XPath libraries and infrastructure. Also, XDM and DFDL both use 
the XML Schema type system (after all, DFDL is a subset of XML Schema) and 
as such XDM already supports the DFDL types.

- If the above is infeasible or too big of a change for this late in the 
process, would it at least be possible to define the DFDL Information Set 
in terms of the XML Information Set standard? The DFDL Information Set 
already appears to be loosely based on it, and may actually be compatible 
(I don?t know) but the relationship is not explicit. Without such a 
statement and the satisfaction of the requirements of extending the XML 
Information Set as defined in that standard, implementations can?t rely on 
the compatibility. If the relationship was made explicit and we ensured 
that the DFDL Information Set was indeed compatible with and extended the 
XML Information Set, then the XDM needed to process DFDL expressions could 
be generated using the Infoset to XDM process described in the XDM 
standard. If we went this route, I would also make sure that we maintain 
compatibility with the PSVI ? that is, we don?t want to introduce concepts 
or information set members that conflict with the PSVI. This will make it 
easier on implementers because they could potentially reuse the same 
internal infoset representation for both the DFDL Information Set and the 
PSVI during validation processes.

- I imagine it will take some investigation to determine if either of 
these options is possible and compatible with DFDL concepts ? I don?t mind 
taking on the task if there is interest in modifying the DFDL infoset. It 
just seems a shame to me to forgo an opportunity to establish some synergy 
with related XML standards.

- In any case, the concept of simple element information items and complex 
element information items seems contrary to established convention. The 
concept of using character information items (and groupings of them as 
explicitly allowed in the XML Information Set standard) to represent child 
simple content has already been established through the XML Information 
Set standard and other related XML standards. It is especially confusing 
given that the same terms as the XML Information Set are used.

- Should everything in the DFDL Information Set have a corresponding 
representation in an XML document generated from or used to generate it 
(not the DFDL Schema, but the result or input to parsing or unparsing)? 
This question occurred to me based on the discussion in the most recent 
teleconference about treating and representing comments as separate kinds 
of content. It was suggested that the infoset would need to handle 
comments in a special way as to differentiate them from non-commented 
content. My concern is that there may not be an appropriate XML 
representation of such an infoset item. Creating a special element in the 
result document would break the property that the result of DFDL 
processing can be validated by the DFDL Schema (because the commented 
element wouldn?t have been declared in the original DFDL Schema ? it 
couldn?t be a declared element because comments can appear anywhere in the 
source content). The only other option I can see would be to treat source 
content identified as comments and indicated as such in the infoset as XML 
comments in the result document. This brings up the interesting 
complication during unparsing of differentiating between ?real? XML 
comments (those that should truly be ignored) and ?output? XML comments 
(those that should be output to the result stream as commented content). 
The solution might be to treat all XML comments in a document used for 
unparsing as available for output as commented content, but it seems 
unreasonable to redefine XML comments in that way. This brings us back to 
the original question: if there is no way to adequately describe commented 
content in a resultant XML document, does everything in the infoset need a 
representation in an XML result document? What are the implications to 
upholding the ability to round-trip (if the resultant XML document doesn?t 
contain everything in the infoset, and everything in the infoset is needed 
to fully describe the source content, then unparsing the resultant XML 
document will not result in the original source content)?

Annotation Elements and Representation Properties

- There seems to be inconsistencies throughout the document, specifically 
in the descriptions of annotation elements and representation properties. 
This is to be expected in a document that?s been under heavy revision over 
such a long time span, but an effort will need to be made to scrub out all 
inconsistencies before the final version. To this end, I?ve found creating 
a table of all annotation elements and their properties helpful. I?ve 
attached what I have so far. It has all annotation elements and their 
attributes and a notional start for the representation properties. I 
intend to complete it as I go and hopefully make sure everything matches 
up in the process.

- I?m not sure I understand the value in having the specialized annotation 
elements. From the DFDL user/developer perspective it seems more difficult 
because they need to recognize additional syntax. For example, when they 
see a dfdl:choice annotation element they need to understand that it?s 
really a dfdl:format with a subset of allowed representation properties 
appropriate to xs:choice elements. They still must refer to the standard 
document to find out which representation properties are allowed, and the 
alternate syntax doesn?t necessarily help in validation because a standard 
dfdl:format annotation element would also have been valid (and the DFDL 
XML Schema can?t determine which representation properties are valid on a 
dfdl:format based only on usage location). It also makes the document more 
confusing because representation properties are refered to as being valid 
for specific dfdl:* annotation elements as opposed to the real meaning 
which is that they?re valid for dfdl:format elements that annotate 
specific XML elements. To put it another way, a representation property 
that is valid for dfdl:choice is also only valid for dfdl:format when used 
as an annotation of xs:choice or as a short form property on xs:choice 
elements ? but this isn?t necessarily clear from the property descriptions 
since they only refer to the dfdl:* special annotation elements. From an 
implementation perspective, it adds complexity because the extra element 
names must be accepted. The DFDL parser will still have to validate 
representation properties and their validity as applied to the parent 
schema element regardless of whether the annotation element is a 
dfdl:format or a special annotation element. Not to mention, wouldn?t 
short form be used most frequently anyway, in which case there are no 
annotation elements? In any case, I see very little value for a 
disproportionate amount of added complexity and potential confusion and I 
suggest the concept of special annotation properties that restrict 
dfdl:format be removed.

- The standard isn?t totally clear and unambiguous on the behavior with 
respect to the dfdl:format selector property. It is mentioned that the 
selector is externally identified, but no additional information is given. 
Are the selectors implementation specific? If so, does that break 
compatibility with alternate DFDL parsers if the selector property is 
used? What if a selector is referred to but doesn?t exist in the parser? 
Is it a schema definition error or a parsing error (when are external 
selectors resolved)? What if there is no ?default? dfdl:format block and 
they all contain non-matching selectors? Is that a processing error 
(should be explicit)?

- When a defined format is put into use, how/when are the representation 
properties checked for validity with respect to the schema element (such 
as xs:choice) that put the defined format into use? To put it another way, 
is it an error (and what kind) if a defined format specifies 
representation properties that aren?t valid for the schema element that 
uses it? Can a defined format contain the special format annotation 
elements (dfdl:sequence, dfdl:choice, etc.)?

- The standard says that a dfdl:defineFormat can contain any of the other 
annotation elements. How are the other annotation elements contained 
within a dfdl:defineFormat (such as dfdl:assert or dfdl:hidden) applied 
when a named format definition is referenced by a dfdl:format ref 
attribute? Can named format definitions be referenced anywhere else other 
than where a dfdl:format is expected? Do all other annotation elements 
make sense or be valid wherever a defined format would be referenced? If 
not, suggest explicitly stating what annotation elements are allowed 
within a dfdl:defineFormat as opposed to saying any are allowed.

- The descriptions for dfdl:assert and dfdl:discriminator read very 
similarly (and probably for good reason) but it?s not clear how they?re 
different. If the failure of a dfdl:discriminator results in a processing 
error, doesn?t that make it equivalent to an assert? In other words, how 
can it be used for control when one of the two possible outcomes results 
in an error that (potentially) halts processing? May want to refine the 
description of dfdl:discriminator.

- What about the positioning of hidden elements relative to siblings? Can 
they appear anywhere within the parent - in which case, is relative 
position important? May want to address this one way or the other.

- The properties for dfdl:textNumberFormat are defined in the 
representation property section. Granted, they may be representation 
properties from the conceptual level, but syntactically, they would appear 
to be different. Can the text number format properties be used in a 
dfdl:format or dfdl:property element? If not, then suggest treating them 
more as attributes of the dfdl:textNumberFormat element and defining them 
there. If so, then I wonder what the purpose of the dfdl:textNumberFormat 
element is?it would seem to fall into the same category as the other 
special dfdl:format annotation elements that restrict the set of valid 
representation properties. Also seems to apply to dfdl:defineEscapeScheme 
and its representation properties.

- The document isn?t clear on how the position of a variable declaration 
impacts its scope. Does it apply to all children of the element to which 
the definition belong (regardless of position relative to the definition), 
to all siblings following the definition (but not preceding), or to all 
elements following the definition (regardless or hierarchy). I assume the 
first, but more clarification would be helpful in order to make it 
unambiguous.

- The value type for representation properties is listed in conceptual 
terms, but shouldn?t all properties actually accept one or more specific 
XML Schema (within the DFDL subset) types (usually atomic)? Making this 
explicit would remove confusion on the part of implementers. For example, 
several are defined as ?Enum? ? though the value may logically be an 
enumerated type, the actual atomic type is something else like xs:string 
or xs:token ? with additional validation to ensure it?s one of the allowed 
enumerated values. The normative standard is first and foremost a 
reference for implementations and as such should be totally unambiguous 
with regard to typing information.

- There appear to be cases (such as alignment) where multiple types can be 
accepted unnecessarily. In the alignment case, a specific xs:string or a 
positive integer type is valid. Wouldn?t this be easier on both the DFDL 
XML Schema and the implementations if, wherever possible, only one atomic 
type was accepted? In the alignment case, a xs:nonNegativeInteger could be 
used where ?0? means ?implicit?.

- With regard to case sensitivity, how is the case equivalency defined for 
different character sets? Is this (or should it be) related to XPath 
collations? Perhaps instead of an ignore case switch the user should be 
allowed to specify a collation for initiator/terminator comparison and the 
DFDL standard would require implementations include a case-insensitive 
collation for common character sets. This would open the door to using 
more general character/string comparison operations and could be important 
in certain settings ? for example, the XPath standard has an example that 
?v? and ?w? are equivalent in Swedish. This may have some other advantages 
? if collations are needed for this kind of thing, then we could probably 
support fn:compare and fn:codepoint-equal in the DFDL XPath subset.

If you?ve made it this far, congrats :) Hopefully this list will spur some 
discussion.

Thanks,

Dave

---
David Glick  |  dglick at dracorp.com  |  703.299.0700 x212
Data Research and Analysis Corp.  |  www.dracorp.com
 --
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  http://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20090225/7f36f91d/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: DFDL_TOC.doc
Type: application/octet-stream
Size: 47616 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/dfdl-wg/attachments/20090225/7f36f91d/attachment-0002.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dfdlref.xls
Type: application/vnd.ms-excel
Size: 26624 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/dfdl-wg/attachments/20090225/7f36f91d/attachment-0001.xls 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ogf-dfdl-v1.0-Core-033.final-dgcommented.doc
Type: application/octet-stream
Size: 2151424 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/dfdl-wg/attachments/20090225/7f36f91d/attachment-0003.obj