[DFDL-WG] Comments on draft 32 of DFDL spec

Tue Jun 10 09:08:24 CDT 2008

Great to hear from you Jim.

So defuddle is alive still! I hope we can push it towards agreement
with the current draft, and then as you suggested it can play a role
of supporting the needed experiments in layering.

I concur that much complexity in DFDL would be better as a library on
top of an extensible core, thereby allowing the standard to be
decomposed better.  I hope defuddle can help us figure that out. We've
 dropped the extensibility and most of layering from v1.0 of DFDL only
 for lack of examples from which to standardize. We have retained
little bits of it, e.g., hidden elements, and we recently added
something which might be called "generalized markup", where you can
specify a type name (restricted to simple type for now) as the
delimiter, and an instance of that type can be used as the intitiator,
terminator, or separator of elements. This seems to be core. That is,
a speculative parser needs syntax to go after, otherwise it ends up
non-deterministic, but we can generalize what that syntax is, and so
get a great deal more generality. We did this to avoid the
proliferation of keywords for properties that is otherwise required
for every variation on separators, e.g., can I use a regexp to define
what a delimiter looks like? Yes, you setup a simple type which is a
string matching a regexp, and use that as the delimiter.

I have been interested for a while in something I call core DFDL which
is the smallest set of features from which one can bootstrap the rest.
 Of course this requires the ability to define new properties and data
 types. I thing about this in terms of having only one built-in type:
dfdl:bit, the expression language.

This sounds appealing, but in thinking about it lots of time gets
spent worrying about how to synthesize character sets from bytes,
numbers from bytes, etc. rather than the thorny stuff we really need
extensibility for, which is the obsure delimiter and nullability
features and stuff like "finalTerminatorCanBeMissing" where the
descriptions in english are complicated and a bootstrap would be a
better characterization. I think the generalized markup described
above may prove to be core. I.e., a sequence of elements with
delimiters naturally implies a parser that has to search for
delimiters, but what those delimiters are can be quite general.

...mikeb

On Jun 8, 2008, at 1:37 PM, JimMyers <jimmyers at ncsa.uiuc.edu> wrote:

> Rick,
>
> A quick update from NCSA on Defuddle: after we got things started at
> PNNL and Tara Talbott created the initial version, we did indeed
> have an 'ice age' with no direct funding for it. Last year we
> received a small amount of funding at NCSA (where I moved to) from
> NARA to start Defuddle moving again and to incorporate it into the
> EU SHAMAN project's digital preservation architecture (lots of
> things in SHAMAN but the relevant idea here is the iRODS storage
> broker calling Defuddle to map things to logical models and the
> Multivalent Browser viewing the logical model).
>
> Due in part to the lack of funding, we've dropped out of the DFDL
> discussion for a while. Probably the most significant issues where
> Defuddle doesn't match the draft spec are that I believe the idea of
> layers has been dropped from the version 1 spec plan (which we think
> is critical)and there has been a lot of work on the spec to deal
> with nilability, etc. (some of which I've argued can be avoided if
> you have layers).
>
> In the next year+, we're rebuilding Defuddle on the latest
> libraries, doing a bunch of scalability and stress testing,
> targetting some common file formats (.e.g PNG) to show it works
> beyond the relatively simple scientific formats we've done to date,
> exploring uses in the digital library/preservation communities and
> looking to extend from XML modeling to RDF/semantic modeling.
>
> The last one of these starts to go in the direction of your
> questions #2 - if you get to RDF, you can start using OWL/rule
> constraints to assure that the data coming out/going in is
> semantically what you want (The DFDL group has discussed the
> backward direction, but at least for Defuddle it is not yet in our
> actual development plans). Our intial thought for doing this is to
> just include a GRRDL annotation in the DFDL file which tells you how
> to map (via XSLT for example) from the XML created by the current
> Defuddle to the RDF you want - in essence going from binary to XML
> and XML to RDF logical model in two sequential steps. We haven't
> thought as deeply about semantic validation as the DFDL group has
> about XML-level validation (i.e. the discussion of errors that can
> be caught at the time of format definition, versus XML schema
> constraint violations on parsing versus parsing errors themselves,
> etc.), but I think getting to RDF will allow a lot of what you're
> talking about with standard semantic web tools.
>
> As we get going on Defuddle again, I hope to get reconnected with
> the DFDL effort - I've mostly been lurking on the list the past year
> + - and see how we can help without being disruptive (perhaps
> looking at post 1.0 changes?).
>
> In any case, I wanted to respond to your question and give a 'what's
> new' report back to the group since I've been quiet a while. For
> Defuddle, while I still think we need to grow, things are looking up
> with a very interested sponsor, international collaboration, and
> some momentum after the 'ice age'. (As always, I'd be very happy to
> talk with anyone who'd like to get involved in Defuddle (software
> development or creating format descriptions and using it,  etc.) -
> Defuddle is open source and I know everyone involved to date would
> really like to see it become community (versus project) driven.)
>
> Cheers,
>
> Jim
>
> James D. Myers, Ph.D.
> Associate Director, Cyberenvironments
> National Center for Supercomputing Applications
> University of Illinois at Urbana Champaign
> 1205 W Clark St.
> Urbana, IL 61801
> 217-244-1934
>
> ----- "RPost" <rp0428 at pacbell.net> wrote:
>
>> Hi,
>>
>>
>>
>> I have been performing ETL since the early '80s when there were over
>> 20 different floppy disk formats
>>
>> and we had to write products like Uniform to copy data from one
>> format
>> to another.
>>
>>
>>
>> At MicroPro Int'l (of WordStar fame) I was also involved in rewriting
>> domestic versions of software
>>
>> to support Shift-JIS for display and input for the Japanese market.
>>
>>
>>
>> More recently I have written and supported ETL software for the
>> telecommunications sector,
>>
>> using ATIS XML formats) and for banking which uses Automated Clearing
>> House (ACH) XML standards.
>>
>> ACH uses a lot of files with a format: FileHeader, (BatchHeader,
>> Detail+, BatchTrailer)+, FileTrailer.
>>
>>
>>
>> As you can imagine I have seen a lot of duplication of effort due to
>> the lack of a standard way to
>>
>> define even the simplest of data formats, let alone the complex ones.
>>
>>
>>
>> Hence my interest in DFDL, which started with Defuddle.
>>
>>
>>
>> I have started to read the recent archives to get a sense of where
>> the
>> DFDL project stands
>>
>> now compared to where it was in 2003-2005 when the great DFDL ice-age
>> began and everyone's
>>
>> projects (Defuddle, Virtual XML) froze in their tracks.
>>
>>
>>
>> Also, I have reviewed the recent Core-032.2 document, added comments
>> to it and have a few
>>
>> general questions about the current state of all things DFDL.
>>
>>
>>
>> 1. Re Parsing (input) only - How complete is the current draft spec
>> in
>> terms of being able
>>
>> to create schemas and a parser for reading binary files? Would it
>> support the most common delimited
>>
>> and header/detail/trailer types of files?
>>
>>
>>
>> From my limited exposure to the drafts and emails, and my early use
>> of
>> Defuddle, it certainly seems like
>>
>> the parsing part is nearly complete and ready to have implementations
>> created.
>>
>>
>>
>> 2. Will a conforming DFDL processor be required to support both
>> parsing and unparsing?
>>
>>
>>
>> I have only needed the parsing direction for most of the ETL work I
>> have been involved with in
>>
>> the last several years.Each company I worked for had their own custom
>> file readers and parsers.
>>
>> Thus I am very interested in having a product like Defuddle that can
>> read/parse the basics.
>>
>>
>>
>> 3. Can someone clarify the extent, if any, to which DFDL is expected
>> to be used to validate data
>>
>> content as opposed to data structure. This isn't at all clear
>> anywhere
>> in the spec that I could find.
>>
>> I added a comment suggesting a statement in either the 'What DFDL is'
>> or the 'What DFDL is not'
>>
>> section about this.
>>
>>
>>
>> My assumption is that if a DFDL schema is used to unparse an infoset
>> the only guarantee is that the
>>
>> resulting physical structure will be correct and not the logical
>> structure.
>>
>>
>>
>> Consider an example of a filed with records having two date fields:
>> start_date, end_date.
>>
>>
>>
>> There are two input (parsing) operations that are potentially useful:
>>
>>
>>
>> 1. Physical - Read the date values into internal elements (or XML
>> elements) and validate the
>>
>> presence/absence/nullable state of each
>>
>>
>>
>> 2. Business/Logical - Validate that the end_date is either null, or
>> if
>> not null is greater than
>>
>> or equal to the start_date.
>>
>>
>>
>> Naturally DFDL will support #1 but does it support #2? I wouldn't
>> think so. The ETL work I do might
>>
>> not even know what the business rule is and even when we do we always
>> have to deal with 'dirty' data.
>>
>>
>>
>> There are two corresponding output (unparsing) operations of
>> interest:
>>
>>
>>
>> 1. Physical - Write the date values in the proper external physical
>> format.
>>
>>
>>
>> 2. Business/Logical - validate that the date values in the infoset
>> that are to be written meet
>>
>> the business rule stated above.
>>
>>
>>
>> Again, I would expect DFDL to support #1 but not #2.
>>
>>
>>
>> I suggest that some comment or description be added to the spec to
>> make clear the extent to which
>>
>> DFDL supports the business/logical aspect of the data.
>>
>>
>>
>> My concern is that users will be misled into thinking they can
>> arbitrarily populate an infoset and
>>
>> then, using a DFDL schema, create an external file that can be
>> properly used by a native application.
>>
>>
>>
>> It's one thing to 'roundtrip' data that is sourced from a native
>> application and quite another to
>>
>> produce valid native files using a non-native application.
>>
>>
>>
>> Unless, of course you have ideas about branching out to BRDL -
>> Business Rule Definition Language?
>>
>>
>>
>> Rick Post
>>
>>
>> --
>>  dfdl-wg mailing list
>>  dfdl-wg at ogf.org
>>  http://www.ogf.org/mailman/listinfo/dfdl-wg
> --
>  dfdl-wg mailing list
>  dfdl-wg at ogf.org
>  http://www.ogf.org/mailman/listinfo/dfdl-wg