[DFDL-WG] DFDL Revision 033 Comments
Steve Hanson
smh at uk.ibm.com
Wed Feb 25 07:15:14 CST 2009
Hi Dave
Thanks for your comments.
However, one general observation about DFDL. It is not the case that the
end result of a DFDL parse is something that is always XML compliant. We
have been very careful to decouple parsing from transformation. If we
said that DFDL created something XML compliant, the next thing is to want
xs:attributes in the DFDL model so that the act of parsing can create XML
that includes attributes. I've worked on a project that made this mistake
and has paid the price ever since, with an overly complicated model that
forces users to know more about XML than they need to. DFDL is intended
to model and parse non-XML data. It happens to use a subset of XML Schema
as its model for convenience. But what it produces is not XML. If you want
to take the output from DFDL and create XML, fine, but that's a post-parse
transformation step.
More specifically:
- The DFDL infoset is not the XML infoset. The XML infoset is not rich
enough to carry type information, everything is character items. It can't
represent binary data. It can't ever handle hex 00 (null).
- The DFDL infoset is not the XDM. The XDM Text node assumes that the
content can be expressed as characters, that is not true of binary data.
The XDM Comment node is too restrictive.
- The DFDL infoset is not the XSD PSVI. That enforces validation of the
data as it is parsed, which is not mandated by DFDL. There were other
reasons too - I'll let Sandy Gao explain further.
However, I can see value in changing the DFDL infoset to be more
compatible with XDM, where that makes sense. We can discuss further on the
call.
Regards
Steve Hanson
Programming Model Architect
WebSphere Message Brokers
Hursley, UK
Internet: smh at uk.ibm.com
Phone (+44)/(0) 1962-815848
"Dave Glick" <dglick at dracorp.com>
Sent by: dfdl-wg-bounces at ogf.org
20/02/2009 21:40
To
<dfdl-wg at ogf.org>
cc
Subject
[DFDL-WG] DFDL Revision 033 Comments
All,
I have completed my review of draft 33 of the standard. I?ve read through
the document as a whole at several times and spent a considerable amount
of time digesting each of the sections. I?ve included nearly all my
comments in the document (attached) but because there are so many, and
reconciling them with ongoing revisions of the document may be difficult,
I?ve included what I feel are the important points in this email (I
realize there are a lot of points here, but I guess that?s what happens
when someone takes a totally fresh look at things). I would like to note
that these are all just suggestions ? I commented wherever I had a
question or concern for completeness sake, but certainly don?t expect all
of my feedback to be incorporated ? especially since many of the concerns
may have already been discussed and addressed in previous iterations of
the standard. My only goal and motivation is to and help make the best
standard possible, and hopefully some of these suggestions will be food
for thought.
General
- The document feels overly verbose and explanatory to me. There are many
whole sections and blocks of text that, while very valuable, don?t really
seem appropriate in a normative standards document. The document should
explain ?what is,? not necessarily ?why it is.? I understand that it was
previously discussed as to whether portions of the document should be
extracted and instead included in a separate non-normative ?DFDL Primer?
similar to the way W3C structured the XML Schema standard. My reaction is
that doing so would help clean up the document. Using technical books as a
metaphor, my own feeling is that a normative standard should be more like
?The Definitive UNIX Reference? and less like ?Introduction to UNIX? or
even ?Expert-Level UNIX?. I think the standard falls a little too far into
the latter category right now.
- Related to the previous comment, the section that seemed the most out of
place to me was the discussion on the parsing and unparsing processes and
their relationship to grammar and general parsing concepts (?DFDL
Properties Introduction?). Though the discussion was extremely valuable
from the standpoint of a potential implementer and may actually be the
only way to implement the standard, I think it may fall too far into the
?how to implement? category and might be more appropriate in an appendix
(marked as non-normative) or a primer if one is created.
- There are certain sections that seemed a little misplaced to me. In
every case, it became clear over time why the document was organized the
way it was, but some revision may make it easier to digest. Generally, it
seemed that some concepts were ?spread out? and not organized under
encompassing umbrella sections. Though the way it?s currently structured
may make the document more componentized, it makes it harder to understand
from a ?where do I look for all the information on X? standpoint. The most
obvious example is all the sections dealing with representation properties
such as the list of representation property precedence and the sections on
sequences, choices, etc. My thought is that since all those sections are
really discussing different representation properties and aspects thereof
that it seems reasonable to group them into one overarching section. There
are other areas where I thought the organization could be improved
including the discussion on element vs. attribute vs. short binding forms
(seemed misplaced given that several other non-representation property
annotation element attributes such as setVariableName can use the
alternate forms, and also broke up the flow of annotation element
descriptions) and the glossary (I like the idea of defining specific terms
used broadly throughout the document to remove ambiguity, but a general
purpose glossary feels more appropriate as an appendix). To codify my
thoughts, I worked up a TOC that I think exhibits a more understandable
organizational structure. I?ve attached it not because I want or expect
the entire structure of the document to be modified, but as a ?jumping
off? point for discussion.
- The standard references RFC 2119 for defining certain terms such as
MUST, SHALL, etc. In most other standards I?ve seen, emphasis is placed on
these terms when their meaning is to be taken from RFC 2119 ? I would
suggest that DFDL do the same. It also appears as though the terms aren?t
being used throughout the document as regularly as they could or should
be. I would suggest that at some point in the final revision process we
scrub the document for requirements concepts and make sure to use the
appropriate RFC 2119 terms where possible. This should remove ambiguity
about what?s expected from implementations.
DFDL Information Set
- I?m sure this has already been discussed at length, but I wonder if it
would be possible to define the DFDL Information Set as an extension to
the XPath Data Model (XDM) as XSLT does for its data model. This would
have many advantages. The XDM is compatible with both the XML Schema PSVI
and the XML Information Set (and the XDM standard explicitly explains the
conversion process to and from each). This therefore provides
interoperability with the alternate representations and uses of a DFDL
Schema as an XML Schema and as plain XML content. Additionally, an XDM (or
some reasonable facsimile) will have to be constructed from the DFDL
Schema anyway to support the XPath capabilities of DFDL ? basing the DFDL
Infoset on XDM to begin with would ensure seamless (or at least easier)
use with XPath libraries and infrastructure. Also, XDM and DFDL both use
the XML Schema type system (after all, DFDL is a subset of XML Schema) and
as such XDM already supports the DFDL types.
- If the above is infeasible or too big of a change for this late in the
process, would it at least be possible to define the DFDL Information Set
in terms of the XML Information Set standard? The DFDL Information Set
already appears to be loosely based on it, and may actually be compatible
(I don?t know) but the relationship is not explicit. Without such a
statement and the satisfaction of the requirements of extending the XML
Information Set as defined in that standard, implementations can?t rely on
the compatibility. If the relationship was made explicit and we ensured
that the DFDL Information Set was indeed compatible with and extended the
XML Information Set, then the XDM needed to process DFDL expressions could
be generated using the Infoset to XDM process described in the XDM
standard. If we went this route, I would also make sure that we maintain
compatibility with the PSVI ? that is, we don?t want to introduce concepts
or information set members that conflict with the PSVI. This will make it
easier on implementers because they could potentially reuse the same
internal infoset representation for both the DFDL Information Set and the
PSVI during validation processes.
- I imagine it will take some investigation to determine if either of
these options is possible and compatible with DFDL concepts ? I don?t mind
taking on the task if there is interest in modifying the DFDL infoset. It
just seems a shame to me to forgo an opportunity to establish some synergy
with related XML standards.
- In any case, the concept of simple element information items and complex
element information items seems contrary to established convention. The
concept of using character information items (and groupings of them as
explicitly allowed in the XML Information Set standard) to represent child
simple content has already been established through the XML Information
Set standard and other related XML standards. It is especially confusing
given that the same terms as the XML Information Set are used.
- Should everything in the DFDL Information Set have a corresponding
representation in an XML document generated from or used to generate it
(not the DFDL Schema, but the result or input to parsing or unparsing)?
This question occurred to me based on the discussion in the most recent
teleconference about treating and representing comments as separate kinds
of content. It was suggested that the infoset would need to handle
comments in a special way as to differentiate them from non-commented
content. My concern is that there may not be an appropriate XML
representation of such an infoset item. Creating a special element in the
result document would break the property that the result of DFDL
processing can be validated by the DFDL Schema (because the commented
element wouldn?t have been declared in the original DFDL Schema ? it
couldn?t be a declared element because comments can appear anywhere in the
source content). The only other option I can see would be to treat source
content identified as comments and indicated as such in the infoset as XML
comments in the result document. This brings up the interesting
complication during unparsing of differentiating between ?real? XML
comments (those that should truly be ignored) and ?output? XML comments
(those that should be output to the result stream as commented content).
The solution might be to treat all XML comments in a document used for
unparsing as available for output as commented content, but it seems
unreasonable to redefine XML comments in that way. This brings us back to
the original question: if there is no way to adequately describe commented
content in a resultant XML document, does everything in the infoset need a
representation in an XML result document? What are the implications to
upholding the ability to round-trip (if the resultant XML document doesn?t
contain everything in the infoset, and everything in the infoset is needed
to fully describe the source content, then unparsing the resultant XML
document will not result in the original source content)?
Annotation Elements and Representation Properties
- There seems to be inconsistencies throughout the document, specifically
in the descriptions of annotation elements and representation properties.
This is to be expected in a document that?s been under heavy revision over
such a long time span, but an effort will need to be made to scrub out all
inconsistencies before the final version. To this end, I?ve found creating
a table of all annotation elements and their properties helpful. I?ve
attached what I have so far. It has all annotation elements and their
attributes and a notional start for the representation properties. I
intend to complete it as I go and hopefully make sure everything matches
up in the process.
- I?m not sure I understand the value in having the specialized annotation
elements. From the DFDL user/developer perspective it seems more difficult
because they need to recognize additional syntax. For example, when they
see a dfdl:choice annotation element they need to understand that it?s
really a dfdl:format with a subset of allowed representation properties
appropriate to xs:choice elements. They still must refer to the standard
document to find out which representation properties are allowed, and the
alternate syntax doesn?t necessarily help in validation because a standard
dfdl:format annotation element would also have been valid (and the DFDL
XML Schema can?t determine which representation properties are valid on a
dfdl:format based only on usage location). It also makes the document more
confusing because representation properties are refered to as being valid
for specific dfdl:* annotation elements as opposed to the real meaning
which is that they?re valid for dfdl:format elements that annotate
specific XML elements. To put it another way, a representation property
that is valid for dfdl:choice is also only valid for dfdl:format when used
as an annotation of xs:choice or as a short form property on xs:choice
elements ? but this isn?t necessarily clear from the property descriptions
since they only refer to the dfdl:* special annotation elements. From an
implementation perspective, it adds complexity because the extra element
names must be accepted. The DFDL parser will still have to validate
representation properties and their validity as applied to the parent
schema element regardless of whether the annotation element is a
dfdl:format or a special annotation element. Not to mention, wouldn?t
short form be used most frequently anyway, in which case there are no
annotation elements? In any case, I see very little value for a
disproportionate amount of added complexity and potential confusion and I
suggest the concept of special annotation properties that restrict
dfdl:format be removed.
- The standard isn?t totally clear and unambiguous on the behavior with
respect to the dfdl:format selector property. It is mentioned that the
selector is externally identified, but no additional information is given.
Are the selectors implementation specific? If so, does that break
compatibility with alternate DFDL parsers if the selector property is
used? What if a selector is referred to but doesn?t exist in the parser?
Is it a schema definition error or a parsing error (when are external
selectors resolved)? What if there is no ?default? dfdl:format block and
they all contain non-matching selectors? Is that a processing error
(should be explicit)?
- When a defined format is put into use, how/when are the representation
properties checked for validity with respect to the schema element (such
as xs:choice) that put the defined format into use? To put it another way,
is it an error (and what kind) if a defined format specifies
representation properties that aren?t valid for the schema element that
uses it? Can a defined format contain the special format annotation
elements (dfdl:sequence, dfdl:choice, etc.)?
- The standard says that a dfdl:defineFormat can contain any of the other
annotation elements. How are the other annotation elements contained
within a dfdl:defineFormat (such as dfdl:assert or dfdl:hidden) applied
when a named format definition is referenced by a dfdl:format ref
attribute? Can named format definitions be referenced anywhere else other
than where a dfdl:format is expected? Do all other annotation elements
make sense or be valid wherever a defined format would be referenced? If
not, suggest explicitly stating what annotation elements are allowed
within a dfdl:defineFormat as opposed to saying any are allowed.
- The descriptions for dfdl:assert and dfdl:discriminator read very
similarly (and probably for good reason) but it?s not clear how they?re
different. If the failure of a dfdl:discriminator results in a processing
error, doesn?t that make it equivalent to an assert? In other words, how
can it be used for control when one of the two possible outcomes results
in an error that (potentially) halts processing? May want to refine the
description of dfdl:discriminator.
- What about the positioning of hidden elements relative to siblings? Can
they appear anywhere within the parent - in which case, is relative
position important? May want to address this one way or the other.
- The properties for dfdl:textNumberFormat are defined in the
representation property section. Granted, they may be representation
properties from the conceptual level, but syntactically, they would appear
to be different. Can the text number format properties be used in a
dfdl:format or dfdl:property element? If not, then suggest treating them
more as attributes of the dfdl:textNumberFormat element and defining them
there. If so, then I wonder what the purpose of the dfdl:textNumberFormat
element is?it would seem to fall into the same category as the other
special dfdl:format annotation elements that restrict the set of valid
representation properties. Also seems to apply to dfdl:defineEscapeScheme
and its representation properties.
- The document isn?t clear on how the position of a variable declaration
impacts its scope. Does it apply to all children of the element to which
the definition belong (regardless of position relative to the definition),
to all siblings following the definition (but not preceding), or to all
elements following the definition (regardless or hierarchy). I assume the
first, but more clarification would be helpful in order to make it
unambiguous.
- The value type for representation properties is listed in conceptual
terms, but shouldn?t all properties actually accept one or more specific
XML Schema (within the DFDL subset) types (usually atomic)? Making this
explicit would remove confusion on the part of implementers. For example,
several are defined as ?Enum? ? though the value may logically be an
enumerated type, the actual atomic type is something else like xs:string
or xs:token ? with additional validation to ensure it?s one of the allowed
enumerated values. The normative standard is first and foremost a
reference for implementations and as such should be totally unambiguous
with regard to typing information.
- There appear to be cases (such as alignment) where multiple types can be
accepted unnecessarily. In the alignment case, a specific xs:string or a
positive integer type is valid. Wouldn?t this be easier on both the DFDL
XML Schema and the implementations if, wherever possible, only one atomic
type was accepted? In the alignment case, a xs:nonNegativeInteger could be
used where ?0? means ?implicit?.
- With regard to case sensitivity, how is the case equivalency defined for
different character sets? Is this (or should it be) related to XPath
collations? Perhaps instead of an ignore case switch the user should be
allowed to specify a collation for initiator/terminator comparison and the
DFDL standard would require implementations include a case-insensitive
collation for common character sets. This would open the door to using
more general character/string comparison operations and could be important
in certain settings ? for example, the XPath standard has an example that
?v? and ?w? are equivalent in Swedish. This may have some other advantages
? if collations are needed for this kind of thing, then we could probably
support fn:compare and fn:codepoint-equal in the DFDL XPath subset.
If you?ve made it this far, congrats :) Hopefully this list will spur some
discussion.
Thanks,
Dave
---
David Glick | dglick at dracorp.com | 703.299.0700 x212
Data Research and Analysis Corp. | www.dracorp.com
--
dfdl-wg mailing list
dfdl-wg at ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20090225/7f36f91d/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: DFDL_TOC.doc
Type: application/octet-stream
Size: 47616 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/dfdl-wg/attachments/20090225/7f36f91d/attachment-0002.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dfdlref.xls
Type: application/vnd.ms-excel
Size: 26624 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/dfdl-wg/attachments/20090225/7f36f91d/attachment-0001.xls
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ogf-dfdl-v1.0-Core-033.final-dgcommented.doc
Type: application/octet-stream
Size: 2151424 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/dfdl-wg/attachments/20090225/7f36f91d/attachment-0003.obj
More information about the dfdl-wg
mailing list