[DFDL-WG] Comments on draft 32 of DFDL spec
Alan Powell
alan_powell at uk.ibm.com
Tue Jun 10 07:14:37 CDT 2008
Rick
Thanks for reviewing the DFDL document, I will get back to you with
responses to your detailed comments.
In the meantime I have added answers to your questions below
Alan Powell
MP 211, IBM UK Labs, Hursley, Winchester, SO21 2JN, England
Notes Id: Alan Powell/UK/IBM email: alan_powell at uk.ibm.com
Tel: +44 (0)1962 815073 Fax: +44 (0)1962 816898
From:
"RPost" <rp0428 at pacbell.net>
To:
<dfdl-wg at ogf.org>
Date:
07/06/2008 23:42
Subject:
[DFDL-WG] Comments on draft 32 of DFDL spec
Hi,
I have been performing ETL since the early '80s when there were over 20
different floppy disk formats
and we had to write products like Uniform to copy data from one format to
another.
At MicroPro Int'l (of WordStar fame) I was also involved in rewriting
domestic versions of software
to support Shift-JIS for display and input for the Japanese market.
More recently I have written and supported ETL software for the
telecommunications sector,
using ATIS XML formats) and for banking which uses Automated Clearing
House (ACH) XML standards.
ACH uses a lot of files with a format: FileHeader, (BatchHeader, Detail+,
BatchTrailer)+, FileTrailer.
As you can imagine I have seen a lot of duplication of effort due to the
lack of a standard way to
define even the simplest of data formats, let alone the complex ones.
Hence my interest in DFDL, which started with Defuddle.
I have started to read the recent archives to get a sense of where the
DFDL project stands
now compared to where it was in 2003-2005 when the great DFDL ice-age
began and everyone's
projects (Defuddle, Virtual XML) froze in their tracks.
Also, I have reviewed the recent Core-032.2 document, added comments to it
and have a few
general questions about the current state of all things DFDL.
1. Re Parsing (input) only - How complete is the current draft spec in
terms of being able
to create schemas and a parser for reading binary files? Would it support
the most common delimited
and header/detail/trailer types of files?
>From my limited exposure to the drafts and emails, and my early use of
Defuddle, it certainly seems like
the parsing part is nearly complete and ready to have implementations
created.
<< AWP >> We believe that the current spec is able to deal with most
common commercial formats
2. Will a conforming DFDL processor be required to support both parsing
and unparsing?
I have only needed the parsing direction for most of the ETL work I have
been involved with in
the last several years.Each company I worked for had their own custom file
readers and parsers.
Thus I am very interested in having a product like Defuddle that can
read/parse the basics.
<< AWP >> Good question. We have been assuming that both parsing and
unparsing would be required.
3. Can someone clarify the extent, if any, to which DFDL is expected to be
used to validate data
content as opposed to data structure. This isn't at all clear anywhere in
the spec that I could find.
I added a comment suggesting a statement in either the 'What DFDL is' or
the 'What DFDL is not'
section about this.
My assumption is that if a DFDL schema is used to unparse an infoset the
only guarantee is that the
resulting physical structure will be correct and not the logical
structure.
<< AWP >> We have distinguished between parsing/unparsing and validation
and assumed that validation can be turned off. The validation that is
performed is that which is definable using schema constructs such as
enumerations, min/max occurs, min/max length, etc. More complex
validation, such as cross field validation, is outside the scope of DFDL
would require something such as schematron to validate the infoset.
Consider an example of a filed with records having two date fields:
start_date, end_date.
There are two input (parsing) operations that are potentially useful:
1. Physical - Read the date values into internal elements (or XML
elements) and validate the
presence/absence/nullable state of each
2. Business/Logical - Validate that the end_date is either null, or if not
null is greater than
or equal to the start_date.
Naturally DFDL will support #1 but does it support #2? I wouldn't think
so. The ETL work I do might
not even know what the business rule is and even when we do we always have
to deal with 'dirty' data.
There are two corresponding output (unparsing) operations of interest:
1. Physical - Write the date values in the proper external physical
format.
2. Business/Logical - validate that the date values in the infoset that
are to be written meet
the business rule stated above.
Again, I would expect DFDL to support #1 but not #2.
I suggest that some comment or description be added to the spec to make
clear the extent to which
DFDL supports the business/logical aspect of the data.
My concern is that users will be misled into thinking they can arbitrarily
populate an infoset and
then, using a DFDL schema, create an external file that can be properly
used by a native application.
It's one thing to 'roundtrip' data that is sourced from a native
application and quite another to
produce valid native files using a non-native application.
Unless, of course you have ideas about branching out to BRDL - Business
Rule Definition Language?
Rick Post
[attachment "ogf-dfdl-v1.0-Core-032.2_rpost.doc" deleted by Alan
Powell/UK/IBM] --
dfdl-wg mailing list
dfdl-wg at ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20080610/bfcc1b5c/attachment.html
More information about the dfdl-wg
mailing list