[DFDL-WG] Comments on draft 32 of DFDL spec

RPost rp0428 at pacbell.net
Sat Jun 7 17:30:46 CDT 2008


Hi, 

 

I have been performing ETL since the early '80s when there were over 20
different floppy disk formats

and we had to write products like Uniform to copy data from one format to
another.

 

At MicroPro Int'l (of WordStar fame) I was also involved in rewriting
domestic versions of software

to support Shift-JIS for display and input for the Japanese market.

 

More recently I have written and supported ETL software for the
telecommunications sector,

using ATIS XML formats) and for banking which uses Automated Clearing House
(ACH) XML standards.

ACH uses a lot of files with a format: FileHeader, (BatchHeader, Detail+,
BatchTrailer)+, FileTrailer.

 

As you can imagine I have seen a lot of duplication of effort due to the
lack of a standard way to

define even the simplest of data formats, let alone the complex ones.

 

Hence my interest in DFDL, which started with Defuddle.

 

I have started to read the recent archives to get a sense of where the DFDL
project stands

now compared to where it was in 2003-2005 when the great DFDL ice-age began
and everyone's 

projects (Defuddle, Virtual XML) froze in their tracks.

 

Also, I have reviewed the recent Core-032.2 document, added comments to it
and have a few

general questions about the current state of all things DFDL.

 

1. Re Parsing (input) only - How complete is the current draft spec in terms
of being able

to create schemas and a parser for reading binary files? Would it support
the most common delimited

and header/detail/trailer types of files?

 

>From my limited exposure to the drafts and emails, and my early use of
Defuddle, it certainly seems like

the parsing part is nearly complete and ready to have implementations
created.

 

2. Will a conforming DFDL processor be required to support both parsing and
unparsing?

 

I have only needed the parsing direction for most of the ETL work I have
been involved with in

the last several years.Each company I worked for had their own custom file
readers and parsers.

Thus I am very interested in having a product like Defuddle that can
read/parse the basics.

 

3. Can someone clarify the extent, if any, to which DFDL is expected to be
used to validate data

content as opposed to data structure. This isn't at all clear anywhere in
the spec that I could find.

I added a comment suggesting a statement in either the 'What DFDL is' or the
'What DFDL is not'

section about this.

 

My assumption is that if a DFDL schema is used to unparse an infoset the
only guarantee is that the

resulting physical structure will be correct and not the logical structure.

 

Consider an example of a filed with records having two date fields:
start_date, end_date.

 

There are two input (parsing) operations that are potentially useful:

 

1. Physical - Read the date values into internal elements (or XML elements)
and validate the

presence/absence/nullable state of each

 

2. Business/Logical - Validate that the end_date is either null, or if not
null is greater than

or equal to the start_date.

 

Naturally DFDL will support #1 but does it support #2? I wouldn't think so.
The ETL work I do might

not even know what the business rule is and even when we do we always have
to deal with 'dirty' data.

 

There are two corresponding output (unparsing) operations of interest:

 

1. Physical - Write the date values in the proper external physical format.

 

2. Business/Logical - validate that the date values in the infoset that are
to be written meet

the business rule stated above.

 

Again, I would expect DFDL to support #1 but not #2.

 

I suggest that some comment or description be added to the spec to make
clear the extent to which

DFDL supports the business/logical aspect of the data.

 

My concern is that users will be misled into thinking they can arbitrarily
populate an infoset and

then, using a DFDL schema, create an external file that can be properly used
by a native application.

 

It's one thing to 'roundtrip' data that is sourced from a native application
and quite another to

produce valid native files using a non-native application.

 

Unless, of course you have ideas about branching out to BRDL - Business Rule
Definition Language?

 

Rick Post

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20080607/f11ba2af/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ogf-dfdl-v1.0-Core-032.2_rpost.doc
Type: application/msword
Size: 2533888 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/dfdl-wg/attachments/20080607/f11ba2af/attachment-0001.doc 


More information about the dfdl-wg mailing list