[dfdl-wg] Defuddle Questions and Pull-Parsing Thoughts

Fri Jun 2 07:40:08 CDT 2006

Tom,
The short answer (if my understanding is correct) is that we started 
basically as you describe but Tara has been implementing on-demand 
reading of data : Only structure is read up front to create empty 
parsing classes that get compiled. When you connect that structure to 
a data source(s), nothing happens to start. When you ask for an 
element, you either read or skip (by knowing/calculating their 
length) everything required to find that element's data (which may 
not simply be all elements above it in the schema) and return the 
value. I don't think we have a mechanism to free the memory of that 
element if you are now done with it, though I think we could.  We've 
at least thought about StAX but mostly at the beginning of creating 
defuddle when it was very new, so part of our decision on JaxMe was 
relative maturity. So - I think we can do better with the current 
architecture than a straight fill-the-structure model, but for truly 
large data, while we might be able to scale, starting directly from a 
streaming approach might be better or at least more natural 
(presumably fits the model of the surrounding program better).

  Jim

At 06:54 AM 6/2/2006, Tom Sugden wrote:
>Hi all,
>
>Apologies for missing the telcon this week due to other work
>pressures. I haven't made much progress with any implementation, but
>have been taking a look at the Defuddle code. I have some questions
>that someone may be able to answer, and a few thoughts for discussion.
>
>The current Defuddle implementation is based upon JAXME, the Java/XML
>binding implementation. Presumably JAXME is used to generate an object
>model representation of the data format described by the DFDL schema.
>And then, I think the underlying data stream would be marshaled into
>an instance of that object model. Is this understanding correct?
>
>If my understanding is correct, I'm concerned that this approach may
>not be suitable for large data streams, since the entire object model
>instance would probably have to be assembled and stored in memory,
>like a DOM tree. Has anybody considered using a streamed pull-parsing
>approach instead, based upon or similar to StAX (Streaming API for
>XML)?
>
>I was thinking along the lines of parsing the DFDL schema into DOM or
>some other internal representation. Then pull-parsing the data stream,
>producing a sequence of StAX-like events corresponding to the data in
>the stream and its structure. During the pull-parsing, the context
>would need to be maintained and the conversion algorithm used for
>transforming parts of the data stream into values of the correct type.
>These values would then be wrapped in corresponding event objects.
>
>If this approach was viable, then these StAX-like APIs could be used
>to implement higher-level applications or APIs. For instance, it would
>be straight-forward to produce an XML serialization of any data
>described by a DFDL schema. One could also imagine binding any data
>described by a DFDL schema to auto-generated Java beans, or to a DOM
>object, when desirable. The process may even be reversible, so that
>data could be written back to a data stream as well as being read from
>one.
>
>I haven't thought this through very deeply yet and my understandings
>of the issues are still quite naive, so I will be very interested to
>hear any comments. Sorry if this avenue has already been explored, or
>I've misunderstood the mechanics of Defuddle or JAXME.
>
>Cheers,
>Tom

James D. Myers
Associate Director, Cyberenvironments and Technologies, NCSA
1205 W. Clark St, MC-257
Urbana, IL 61801
217-244-1934
jimmyers at ncsa.uiuc.edu