[DFDL-WG] Comments on draft 32 of DFDL spec

Tue Jun 10 07:14:37 CDT 2008

Rick 

Thanks for reviewing the DFDL document, I will get back to you with 
responses to your detailed comments.

In the meantime I have added answers to your questions below

Alan Powell

 MP 211, IBM UK Labs, Hursley,  Winchester, SO21 2JN, England
 Notes Id: Alan Powell/UK/IBM     email: alan_powell at uk.ibm.com 
 Tel: +44 (0)1962 815073                  Fax: +44 (0)1962 816898

From:
"RPost" <rp0428 at pacbell.net>
To:
<dfdl-wg at ogf.org>
Date:
07/06/2008 23:42
Subject:
[DFDL-WG] Comments on draft 32 of DFDL spec

Hi, 

I have been performing ETL since the early '80s when there were over 20 
different floppy disk formats
and we had to write products like Uniform to copy data from one format to 
another.

At MicroPro Int'l (of WordStar fame) I was also involved in rewriting 
domestic versions of software
to support Shift-JIS for display and input for the Japanese market.

More recently I have written and supported ETL software for the 
telecommunications sector,
using ATIS XML formats) and for banking which uses Automated Clearing 
House (ACH) XML standards.
ACH uses a lot of files with a format: FileHeader, (BatchHeader, Detail+, 
BatchTrailer)+, FileTrailer.

As you can imagine I have seen a lot of duplication of effort due to the 
lack of a standard way to
define even the simplest of data formats, let alone the complex ones.

Hence my interest in DFDL, which started with Defuddle.

I have started to read the recent archives to get a sense of where the 
DFDL project stands
now compared to where it was in 2003-2005 when the great DFDL ice-age 
began and everyone's 
projects (Defuddle, Virtual XML) froze in their tracks.

Also, I have reviewed the recent Core-032.2 document, added comments to it 
and have a few
general questions about the current state of all things DFDL.

1. Re Parsing (input) only - How complete is the current draft spec in 
terms of being able
to create schemas and a parser for reading binary files? Would it support 
the most common delimited
and header/detail/trailer types of files?

>From my limited exposure to the drafts and emails, and my early use of 
Defuddle, it certainly seems like
the parsing part is nearly complete and ready to have implementations 
created.

<< AWP >> We believe that the current spec is able to deal with most 
common commercial formats 

2. Will a conforming DFDL processor be required to support both parsing 
and unparsing?

I have only needed the parsing direction for most of the ETL work I have 
been involved with in
the last several years.Each company I worked for had their own custom file 
readers and parsers.
Thus I am very interested in having a product like Defuddle that can 
read/parse the basics.

<< AWP >> Good question. We have been assuming that both parsing and 
unparsing would be required.

3. Can someone clarify the extent, if any, to which DFDL is expected to be 
used to validate data
content as opposed to data structure. This isn't at all clear anywhere in 
the spec that I could find.
I added a comment suggesting a statement in either the 'What DFDL is' or 
the 'What DFDL is not'
section about this.

My assumption is that if a DFDL schema is used to unparse an infoset the 
only guarantee is that the
resulting physical structure will be correct and not the logical 
structure.

<< AWP >> We have distinguished between parsing/unparsing and validation 
and assumed that validation can be turned off. The validation that is 
performed is that which is definable using schema constructs such as 
enumerations, min/max occurs, min/max length, etc. More complex 
validation, such as cross field validation, is outside the scope of DFDL 
would require something such as schematron to validate the infoset.

Consider an example of a filed with records having two date fields: 
start_date, end_date.

There are two input (parsing) operations that are potentially useful:

1. Physical - Read the date values into internal elements (or XML 
elements) and validate the
presence/absence/nullable state of each

2. Business/Logical - Validate that the end_date is either null, or if not 
null is greater than
or equal to the start_date.

Naturally DFDL will support #1 but does it support #2? I wouldn't think 
so. The ETL work I do might
not even know what the business rule is and even when we do we always have 
to deal with 'dirty' data.

There are two corresponding output (unparsing) operations of interest:

1. Physical - Write the date values in the proper external physical 
format.

2. Business/Logical - validate that the date values in the infoset that 
are to be written meet
the business rule stated above.

Again, I would expect DFDL to support #1 but not #2.

I suggest that some comment or description be added to the spec to make 
clear the extent to which
DFDL supports the business/logical aspect of the data.

My concern is that users will be misled into thinking they can arbitrarily 
populate an infoset and
then, using a DFDL schema, create an external file that can be properly 
used by a native application.

It's one thing to 'roundtrip' data that is sourced from a native 
application and quite another to
produce valid native files using a non-native application.

Unless, of course you have ideas about branching out to BRDL - Business 
Rule Definition Language?

Rick Post
 [attachment "ogf-dfdl-v1.0-Core-032.2_rpost.doc" deleted by Alan 
Powell/UK/IBM] --
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  http://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20080610/bfcc1b5c/attachment.html