[DFDL-WG] Comments on draft 32 of DFDL spec

JimMyers jimmyers at ncsa.uiuc.edu
Sun Jun 8 12:37:22 CDT 2008


Rick,

A quick update from NCSA on Defuddle: after we got things started at PNNL and Tara Talbott created the initial version, we did indeed have an 'ice age' with no direct funding for it. Last year we received a small amount of funding at NCSA (where I moved to) from NARA to start Defuddle moving again and to incorporate it into the EU SHAMAN project's digital preservation architecture (lots of things in SHAMAN but the relevant idea here is the iRODS storage broker calling Defuddle to map things to logical models and the Multivalent Browser viewing the logical model).

Due in part to the lack of funding, we've dropped out of the DFDL discussion for a while. Probably the most significant issues where Defuddle doesn't match the draft spec are that I believe the idea of layers has been dropped from the version 1 spec plan (which we think is critical)and there has been a lot of work on the spec to deal with nilability, etc. (some of which I've argued can be avoided if you have layers).

In the next year+, we're rebuilding Defuddle on the latest libraries, doing a bunch of scalability and stress testing, targetting some common file formats (.e.g PNG) to show it works beyond the relatively simple scientific formats we've done to date, exploring uses in the digital library/preservation communities and looking to extend from XML modeling to RDF/semantic modeling.

The last one of these starts to go in the direction of your questions #2 - if you get to RDF, you can start using OWL/rule constraints to assure that the data coming out/going in is semantically what you want (The DFDL group has discussed the backward direction, but at least for Defuddle it is not yet in our actual development plans). Our intial thought for doing this is to just include a GRRDL annotation in the DFDL file which tells you how to map (via XSLT for example) from the XML created by the current Defuddle to the RDF you want - in essence going from binary to XML and XML to RDF logical model in two sequential steps. We haven't thought as deeply about semantic validation as the DFDL group has about XML-level validation (i.e. the discussion of errors that can be caught at the time of format definition, versus XML schema constraint violations on parsing versus parsing errors themselves, etc.), but I think getting to RDF will allow a lot of what you're talking about with standard semantic web tools.

As we get going on Defuddle again, I hope to get reconnected with the DFDL effort - I've mostly been lurking on the list the past year+ - and see how we can help without being disruptive (perhaps looking at post 1.0 changes?).

In any case, I wanted to respond to your question and give a 'what's new' report back to the group since I've been quiet a while. For Defuddle, while I still think we need to grow, things are looking up with a very interested sponsor, international collaboration, and some momentum after the 'ice age'. (As always, I'd be very happy to talk with anyone who'd like to get involved in Defuddle (software development or creating format descriptions and using it,  etc.) - Defuddle is open source and I know everyone involved to date would really like to see it become community (versus project) driven.)

Cheers,

 Jim

James D. Myers, Ph.D.
Associate Director, Cyberenvironments
National Center for Supercomputing Applications
University of Illinois at Urbana Champaign
1205 W Clark St.
Urbana, IL 61801
217-244-1934

----- "RPost" <rp0428 at pacbell.net> wrote:

> Hi,
> 
>  
> 
> I have been performing ETL since the early '80s when there were over
> 20 different floppy disk formats
> 
> and we had to write products like Uniform to copy data from one format
> to another.
> 
>  
> 
> At MicroPro Int'l (of WordStar fame) I was also involved in rewriting
> domestic versions of software
> 
> to support Shift-JIS for display and input for the Japanese market.
> 
>  
> 
> More recently I have written and supported ETL software for the
> telecommunications sector,
> 
> using ATIS XML formats) and for banking which uses Automated Clearing
> House (ACH) XML standards.
> 
> ACH uses a lot of files with a format: FileHeader, (BatchHeader,
> Detail+, BatchTrailer)+, FileTrailer.
> 
>  
> 
> As you can imagine I have seen a lot of duplication of effort due to
> the lack of a standard way to
> 
> define even the simplest of data formats, let alone the complex ones.
> 
>  
> 
> Hence my interest in DFDL, which started with Defuddle.
> 
>  
> 
> I have started to read the recent archives to get a sense of where the
> DFDL project stands
> 
> now compared to where it was in 2003-2005 when the great DFDL ice-age
> began and everyone's
> 
> projects (Defuddle, Virtual XML) froze in their tracks.
> 
>  
> 
> Also, I have reviewed the recent Core-032.2 document, added comments
> to it and have a few
> 
> general questions about the current state of all things DFDL.
> 
>  
> 
> 1. Re Parsing (input) only - How complete is the current draft spec in
> terms of being able
> 
> to create schemas and a parser for reading binary files? Would it
> support the most common delimited
> 
> and header/detail/trailer types of files?
> 
>  
> 
> From my limited exposure to the drafts and emails, and my early use of
> Defuddle, it certainly seems like
> 
> the parsing part is nearly complete and ready to have implementations
> created.
> 
>  
> 
> 2. Will a conforming DFDL processor be required to support both
> parsing and unparsing?
> 
>  
> 
> I have only needed the parsing direction for most of the ETL work I
> have been involved with in
> 
> the last several years.Each company I worked for had their own custom
> file readers and parsers.
> 
> Thus I am very interested in having a product like Defuddle that can
> read/parse the basics.
> 
>  
> 
> 3. Can someone clarify the extent, if any, to which DFDL is expected
> to be used to validate data
> 
> content as opposed to data structure. This isn't at all clear anywhere
> in the spec that I could find.
> 
> I added a comment suggesting a statement in either the 'What DFDL is'
> or the 'What DFDL is not'
> 
> section about this.
> 
>  
> 
> My assumption is that if a DFDL schema is used to unparse an infoset
> the only guarantee is that the
> 
> resulting physical structure will be correct and not the logical
> structure.
> 
>  
> 
> Consider an example of a filed with records having two date fields:
> start_date, end_date.
> 
>  
> 
> There are two input (parsing) operations that are potentially useful:
> 
>  
> 
> 1. Physical - Read the date values into internal elements (or XML
> elements) and validate the
> 
> presence/absence/nullable state of each
> 
>  
> 
> 2. Business/Logical - Validate that the end_date is either null, or if
> not null is greater than
> 
> or equal to the start_date.
> 
>  
> 
> Naturally DFDL will support #1 but does it support #2? I wouldn't
> think so. The ETL work I do might
> 
> not even know what the business rule is and even when we do we always
> have to deal with 'dirty' data.
> 
>  
> 
> There are two corresponding output (unparsing) operations of interest:
> 
>  
> 
> 1. Physical - Write the date values in the proper external physical
> format.
> 
>  
> 
> 2. Business/Logical - validate that the date values in the infoset
> that are to be written meet
> 
> the business rule stated above.
> 
>  
> 
> Again, I would expect DFDL to support #1 but not #2.
> 
>  
> 
> I suggest that some comment or description be added to the spec to
> make clear the extent to which
> 
> DFDL supports the business/logical aspect of the data.
> 
>  
> 
> My concern is that users will be misled into thinking they can
> arbitrarily populate an infoset and
> 
> then, using a DFDL schema, create an external file that can be
> properly used by a native application.
> 
>  
> 
> It's one thing to 'roundtrip' data that is sourced from a native
> application and quite another to
> 
> produce valid native files using a non-native application.
> 
>  
> 
> Unless, of course you have ideas about branching out to BRDL -
> Business Rule Definition Language?
> 
>  
> 
> Rick Post
> 
>   
> --
>   dfdl-wg mailing list
>   dfdl-wg at ogf.org
>   http://www.ogf.org/mailman/listinfo/dfdl-wg


More information about the dfdl-wg mailing list