[DFDL-WG] DFDL Data Model Recommendations

Wed May 13 04:23:12 CDT 2009

Hi Dave

Thanks for the write up. A couple of comments before today's call:

First off I've remembered one of the reasons that we decided not to use 
the XDM directly. It was to do with the range of characters supported by 
XML versus DFDL. We need our infoset to be capable of holding x'00' for 
example.

The next two statements indicate most clearly the problem I see with the 
proposal as worded:
- "The DFDL Data Model supports all of the nodes types in the XDM"  - no 
it does not, it only covers those that it actually needs.
- "All other XDM nodes may be provided by a given DFDL implementation " - 
no they should not.  They can be provided, sure, but not by a DFDL 
implementation.

I think you are falling into the same trap that we could have fallen into 
when adopting XSDL as the basis for DFDL.  The spec makes it very clear 
that a DFDL Schema uses an explicit subset of XML Schema, and that, for 
example, an XML Schema that contains attributes can not be a valid DFDL 
Schema. Not even if the attributes are un-annotated. Any mapping of DFDL 
output to XML is a separate exercise.  The same must be true for the DFDL 
Infoset - it can use XDM as the basis but that's as far as the statement 
should go.  What we have to weigh up is whether taking a subset of XDM, 
plus extending it to handle x'00', unresolved choice nodes, etc, makes the 
DFDL Infoset easier for users to consume, or whether we are better off 
with a separate model. 

Overall, I think you are tailoring the DFDL infoset to your end goal of 
XML compatibility and I don't think that is the correct approach. 

I don't think we should discuss use of XML Infoset or PSVI at all.  Again, 
that is a transformation to XML and beyond the scope of the spec. If a 
user wants to see how that interops with XDM, he can use the XDM spec.

Regards

Steve Hanson
Programming Model Architect
WebSphere Message Brokers
Hursley, UK
Internet: smh at uk.ibm.com
Phone (+44)/(0) 1962-815848

"Mike Beckerle" <mbeckerle.dfdl at gmail.com> 
12/05/2009 14:45
Please respond to
<mbeckerle.dfdl at gmail.com>

To
"'Dave Glick'" <dglick at dracorp.com>, Steve Hanson/UK/IBM at IBMGB
cc
<dfdl-wg at ogf.org>
Subject
RE: [DFDL-WG] DFDL Data Model Recommendations

Augmented infoset does not contain things like all branches of choice. 

Augmented infoset contains the hidden nodes. Regular infoset lacks hidden 
nodes. Example, hidden element holds number of occurrances of a recurring 
element later in the bit stream. Logically this hidden element isn't 
there. But expressions need to refer to its value, so it has to be in 
"some" infoset, so we call that the augmented infoset. Another thing that 
will often be hidden are the elements used to discriminate arms of 
choices. These "tags" are almost never going to want to show through into 
the logical data model, as XML tags everything so explicitly, that having 
additional tags around is nothing but clumsy. These additional nodes in 
the augmented infoset are used both in parsing and unparsing.

Re: unparsing invalid models. 

We perhaps need another term, but the augmented infoset also gets default 
processing done on it. This allows an invalid infoset to be unparsed 
because certain logic will be used to make it valid. This logic is the 
filling in (augmenting the infoset) of default values (only!). E.g., if an 
element has a default value, or fixed value, and is required, but is not 
in the infoset, then it will be automatically added to make the resulting 
infoset valid. 

Similarly, if an element has an outputValueCalc property, then it is not 
expected to be found in the infoset. Rather, on unparsing, the calculation 
is performed and the infoset node created as an augmentation to the prior 
infoset.

The values used when unparsing must be well formed for their corresponding 
types, and representable by those types, but they do not need to be valid 
in the sense of being within a numeric range constraint, or obeying a 
regular expression pattern as expressed on the XSD. 

So we do work to fill in defaultable values and thereby make an infoset 
"more valid" than it was, but we don't generally require XML-style 
validity.

...mike

From: dfdl-wg-bounces at ogf.org [mailto:dfdl-wg-bounces at ogf.org] On Behalf 
Of Dave Glick
Sent: Monday, May 11, 2009 8:22 PM
To: Steve Hanson
Cc: dfdl-wg at ogf.org
Subject: [DFDL-WG] DFDL Data Model Recommendations

All,

I have attached my first draft of recommendations for the DFDL Data Model. 
It is based heavily on XDM and should be mostly compatible. Where there 
are incompatibilities, they are by convention and not related to the 
actual representation so I believe a system that is set up to work with 
XDM should be easily adaptable to work with DFDL. This was important to me 
? in my project I would be using DFDL as part of a pipeline process that 
also involved transforming the results of parsing using XSLT and then 
unparsing the results of transformation. I imagine many other potential 
users of DFDL will be engaging in similar XML pipeline scenarios. In these 
situations, the closer the DFDL model is to those used by other XML 
technologies, the better. Hopefully the recommendation meets the 
requirements Steve mentions below ? it turns out most of the existing 
infoset could be mapped directly to XDM concepts. I introduced a new node 
type for the unresolvable concept to match the existing infoset, but 
discuss how it is related to XDM and other XML technologies.

This exercise brought to major questions to mind:

- Must an instance of the XML Infoset (and XML document) be valid against 
a given DFDL Schema (as determined by an XML Schema validation engine) to 
be available for unparsing? If not, is it up to the DFDL implementation to 
determine the suitability of the XML Infoset Character Information Items 
for their given unparsed data type? The real question is: for the 
unparsing process can a DFDL Data Model be constructed from an XML Infoset 
directly or only from a PVSI (where does the data for unparsing really 
come from)? It would seem to me that if the input XML isn't valid against 
the DFDL Schema, then it probably can't be unparsable - otherwise, how 
would the invalid portions be handled (such as strings that should be 
numeric or a structure that doesn't match)?

- I am confused by the notion of the "augmented infoset". The regular 
infoset appears to be based on the logical structure of the data 
post-parsing. In other words, choices are resolved and the result looks 
something like an XML Infoset, PSVI, or XDM tree might following something 
like XSLT transformation. The augmented infoset on the other hand appears 
to be based on the logical structure of the DFDL Schema being used for 
processing and therefore contains branches for all choice possibilities, 
etc. It is "filled in" as parsing takes place. This doesn't make a lot of 
sense to me - what about the branches for which there was no data to 
"fill-in" (such as choice branches that weren't followed)? Are they 
dropped following parsing? If not, then there are a lot of information 
items in the final tree that have no value. It made more sense to me to 
consider the DFDL Data Model as being constructed during parsing and at 
any given time in the parsing process a portion of the model (that which 
has already been parsed) is available.

Hopefully those questions made sense... I should (finally) be on the call 
this Wednesday to discuss.

Dave

From: Steve Hanson [mailto:smh at uk.ibm.com] 
Sent: Wednesday, May 06, 2009 11:15 AM
To: Dave Glick
Cc: Alan Powell; dfdl-wg at ogf.org
Subject: Re: [DFDL-WG] Agenda for OGF WG call 6 May 2009

Dave 

Two intents of the infoset was that it should be a) simple and b) easily 
related to the grammar in 11.3, so whatever you come up with needs to take 
those requirements into account. 

"Parts of XDM that have no relevance to DFDL but are also not conflicting 
should probably be left in for conciseness and compatibility." - a) above 
would imply the opposite. 

The XDM spec defines the rules for how an XDM can be created from an XML 
Infoset or a PSVI.  We can do a similar exercise for DFDL Infoset, for 
those users who want to use XSL for any post-DFDL transformation. 

Regards

Steve Hanson
Programming Model Architect
WebSphere Message Brokers
Hursley, UK
Internet: smh at uk.ibm.com
Phone (+44)/(0) 1962-815848 

Dave Glick <dglick at dracorp.com> 
Sent by: dfdl-wg-bounces at ogf.org 
06/05/2009 13:55 

To
Alan Powell/UK/IBM at IBMGB, "dfdl-wg at ogf.org" <dfdl-wg at ogf.org> 
cc

Subject
Re: [DFDL-WG] Agenda for OGF WG call 6 May 2009

All, 

My apologies, but I will be unable to make the call again this week. I was 
hoping to have some suggestions regarding the infoset/data model for 
discussion today, but it's not quite ready (I still have a little more 
digging to do through the rest of the spec to make sure what I'm 
suggesting can adequately capture all the representation cases in DFDL). 
I'll try to get something out by the end of the day for review and 
discussion on next week's call. 

In general, it appears to me (and I'm admittedly not as versed in the 
various XML standards as the other members of the group) that we can bring 
the DFDL Infoset very closely in line with the XDM. Specifically, I've 
been looking at the way XSLT 2.0 treats XDM as it's data model. It states 
clearly that XDM is the model for XSLT with certain explicit caveats and 
additions. This follows the XDM guidance of how it should be used by other 
standards (specifically in XDM Section 7 and Appendix A). The task for 
DFDL therefore consists of two parts: what parts of the XDM are in 
conflict with DFDL and should be explicitly excluded, and what parts of 
DFDL have no corresponding support in XDM and should be appended. Parts of 
XDM that have no relevance to DFDL but are also not conflicting should 
probably be left in for conciseness and compatibility. 

My biggest concern is over the use of two different types of Element 
Information Items in the DFDL specification as this seems so contrary to 
convention in XDM. My recommendations include treating all element nodes 
similarly to XDM as complex and those element nodes that actually only 
contain simple content should have a single child of the XDM text node 
type or a new DFDL value node type (not sure the best way to go here). 

In any case, I'll pass along a full recommendation soon. 

Dave 

From: dfdl-wg-bounces at ogf.org [dfdl-wg-bounces at ogf.org] On Behalf Of Alan 
Powell [alan_powell at uk.ibm.com]
Sent: Wednesday, May 06, 2009 6:01 AM
To: dfdl-wg at ogf.org
Subject: [DFDL-WG] Agenda for OGF WG call 6 May 2009

Agenda: 

1. Go through actions. 

2. LengthKind on Sequences and choices. 

LengthKind on sequences and choices and their parent element has proved 
confusing to new users of DFDL. It is proposed that lengthKind is removed 
from groups and only allow it to be set on parent element. See email from 
SH 

3. Discuss UnorderedInitated email from SH 

4. Infoset codepage and encoding 

The spec does not say what codepage and encoding is used for string 
fields. 

5. AOB 
Next version (034) 

Current Actions: 

No
Action 
012
AP/SH: Update decimalCalendarScheme 
10/9: Not allocated yet 
17/9: No update 
24/9: Add calendar binary formats to actions 
22/10: No progress 
16/1: proposal distributed and discussed. Will be redistributed 
21/1: add locale, 
04/02: changed from locale to specific properties 
18/2: Need more investigation of ICU strict/lax behaviour. 
08/04: Not discussed 
22/04: AP to complete asap once the ICU strict/lax behaviour is 
understood. 
29/04: No progress 
020
SH: Resolve packedDecimalSignCodes behaviour depends on NumberCheckPolicy 
22/10: No progress 
10/12: added how to decide to overpunch and sign position 
11/02: proposal largely agreed. SH to make minor changes 
18/02: AP to document unsigned type behaviour 
25/02: no progress 
08/04: Not discussed 
22/04: SH to complete last remaining issue, which is the behaviour when 
logical type is signed/unsigned and the physical type is unsigned/signed. 
29/04: SH had identified a problem with definition values and types in the 
infoset and will email proposal.  DG to be asked to accelerate action 032 
to see if helps 
024
<No owner> String XML type 
08/04: Not discussed 
22/04: Need to allocate owner. Work is to describe the semantics of using 
dfdl:representation="xml" to model a well-formed XML fragment in an 
overall non-XML document described by a DFDL schema. 
29/04: As no resource availbel to progress this action agreed to defer 
from V1. Will close next week if no objections 

026
SH: Envelopes and Payloads 
08/04: Not discussed explicity, but recursive use of DFDL is tied up with 
this 
22/04: Two aspects. Firstly compositional - do sufficient mechanisms exist 
to model an envelope with a payload that varies. Secondly markup syntax - 
this might be defined in the envelope. 
The second of these is very much tied up with the variable markup action 
028, so will be considered there. SH to verify the composition aspect. 
29/04: SH and AP working on proposal. related to Action 028 
027
SH: Property precedence tables 
08/04: Not discussed 
22/04: Two things missing from the existing precedence trees. Firstly, 
does not show alternates (eg, initiator v initiatorkind). Secondly, need a 
tree per concrete DFDL object (eg, element). SH to update. 
29/04: No progress 
028
SH: Variable markup 
08/04: Discussed briefly at end of call, IBM to see whether there any use 
cases that require recursive use of DFDL. 
15/04: Use case was distributed and will be discussed on next call. 
22/04: The use case in question is EDI where the terminating markup for 
the payload segments is defined in the ISA envelope segment. The markup is 
modelled as an element of simple type where the allowable markup values 
are defined as enums on the type. But we need to handle two cases - 
firstly where the envelope is present, so the value used by the payload is 
taken from the envelope. Secondly where only the payload is present. Here 
we need a way of scanning for all the enum values, and adopting the one we 
actually find, when parsing. And using a default when unparsing. SH to 
explore use of a DFDL variable, where the variable has a default, but also 
has a type that is the same as the markup element - that way we get to use 
the enums without defining everything twice. 
29/04: SH and AP working on proposal. 
029
MB: valueCalc (output length calculation) 
08/04: Not discussed 
22/04: Action allocated to MB, this is to complete the work started at the 
Hursley WG F2F meeting. 
29/04: No progress 
032
DG: Investigate compatibility between DFDL infoset and XDM 
08/04: No update 
22/04: No update 
29/04: No update 
033
AP/TK: Assert/Discriminator semantics. AP to document. TK to check uses of 
discriminator besides choice. 
08/04: In progress within IBM 
22/04: Waiting for TK to return from leave to complete. 
29/04: TK has sent examples shown need for discriminators beyond choice. 
Agreed. MB to respond to TK 
036
SH: Provide use case for floating component in a sequence 
08/04: Raised 
15/04: Use case sent and discussed. SH to do further investigation 
22/04: IBM feedback from WTX team is that alternate suggested ways of 
modelling the EDI floating NTE segment have significant usability issues. 
The DFDL principle is that for a problem that can be expressed as 
two-layered, then two DFDL models are needed.  The EDI NTE segment does 
not fall into this though, as its use is on a per sequence basis. Ongoing. 

29/04: Agreed that need to be in V1. SH to make a proposal 
037
All: Approach for XML Schema 1.0 UPA checks. 
22/04: Several non-XML models, when expressed in their most obvious DFDL 
Schema form, would fail XML Schema 1.0 Unique Particle Attribution checks 
that police model ambiguity.  And even re-jigging the model sometimes 
fails to fix this. Note this is equally applicable to XMl Schema 1.1 and 
1.0. While the DFDL parser/unparser can happily resolve the ambiguities, 
the issue is one of definition. If an XSD editor that implements UPA 
checks is used to create DFDL Schema, then errors will be flagged. DFDL 
may have to adopt the position that: 
a)DFDL parser/unparser will not implement some/all UPA checks (exact 
checks tbd) 
b) XML Schema editors that implement UPA checks will not be suitable for 
all DFDL models 
c) If DFDL annotations are removed, the resulting pure XSD will not always 
be valid (ie, the equivalent XML is ambiguous and can't be modelled by XML 
Schema 1.0) 
Ongoing in case another solution can be found. 
29/04: Will ask DG and S Gao for oppinion before closing 
038
MB: Submit response to OMG RFI for non-XML standardization 
22/04: First step is for MB to mail the OGF Data Area chair to say that we 
want to submit 
29/04: MB has been in contact with OMG and will sunbit dfdl. 
039
SKK: Approach for creating Schema-For-DFDL xsds. 
22/04: Resolve issue around multiple declarations needed for DFDL 
properties, perhaps using MB's meta approach 
29/04: Don't like qualified attributes in long form. SKK to check there 
are no code gen implications, eg EMF.

Alan Powell

MP 211, IBM UK Labs, Hursley,  Winchester, SO21 2JN, England
Notes Id: Alan Powell/UK/IBM     email: alan_powell at uk.ibm.com 
Tel: +44 (0)1962 815073                  Fax: +44 (0)1962 816898

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 

--
dfdl-wg mailing list
dfdl-wg at ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 

--
 dfdl-wg mailing list
 dfdl-wg at ogf.org
 http://www.ogf.org/mailman/listinfo/dfdl-wg 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20090513/66416bb6/attachment-0001.html