[dfdl-wg] Plumbing document

Mon Sep 5 12:59:21 CDT 2005

The only reason I haven't released the document "officially" to the 
working group is that it is very incomplete and half baked. There are no 
IP concerns. 

I'm concerned that the approach doesn't even hold up. In particular there 
is a "forward induction" from earlier fields to later fields implied in 
the approach. I'm not sure this works except for "stream-capable" formats. 
Some formats can depend on random access capabilities, or definition 
working back from the end of the data. 

Mike Beckerle
Architect, Scalable Computing
IBM Software Group
Information Integration Solutions
Westborough, MA

Martin Westhead <martinwesthead at yahoo.co.uk> 
Sent by: owner-dfdl-wg at ggf.org
09/05/2005 09:12 AM

To
"Robert E. McGrath" <mcgrath at ncsa.uiuc.edu>
cc
dfdl-wg at gridforum.org
Subject
Re: [dfdl-wg] Plumbing document

Hi Robert,

See inline:

Robert E. McGrath wrote:
> Greetings,
> 
> I took a quick glance at the streams and semantics notes sent yesterday.
> 
> There is obviously common ground here, and both meshed with my own
> half-baked thinking.
> 
> They did make me disagree with one part of the approach, which may
> make things a little simpler.
> 
> So here's my thinking.
> 
> IMO, there is no reason to worry about a detailed definition of streams.
> 
> It seems to me that we are simply dealing with sequences of bits, which
> can be streams or not.
> 
> So I see the universe of DFDL as:
> 
>    sequence of bits ==> computer science data type ==> seq. of bits
> 
> where CS data type is "byte", "int32", etc. (Z, G, et al. in the 
semantics
> note)

There are two issues/points I have with your statement above:

  1. We have chosen (at this point) the XML/XML Schema data model as our 
data model. Another way of thinking about what we are doing here as 
follows: XML Schema provides a way of describing the syntax and type 
level semantics of XML documents. DFDL extends that capability so that 
XML Schema can describe other (want to say "all") text and binary formats.

  2. DFDL is describing:

  bits ==> XML type ==> ... ==> XML type ==> ... ==> XML type ==> bits

i.e. there are arbitrary layers of description that we would like that 
need to be separable (modular). e.g.

bits ==> strings ==> ints ==> (back again).

> The DFDL talks about the CS data types, with decorations to tell how to
> do the transformations to bits. I think that's all DFDL does (which is
> plenty!)
> 
> 
> Now the second place that "streams" enters the picture is to deal with
> XML's notion of the order of elements, which DFDL is trying to use
> to deal with the order of the bits. 
> 
> (I think this confuses me because it is overloading XML's notion of an 
> XML file with the organization of the described files.  You can make 
> it work, but it's not really clean, at least to me.)

I agree that this is a concern.

> To me, it is more natural to define a notion of a "sequence of CS data 
types",
> i.e., the elements. The decorations indicate where the bits for each
> element are (i.e., each element has it's own sequence of bits, not 
necessarily
> from a continuous stream). 
> 
> This is more general than a stream (it can accomodate random access), 
and 
> probably can be stated as a simple mapping. 
> 
> 
> So the summary is: 
> 
> I think it would simplify the abstractions to not talk about streams.
> 
> Instead, we should talk about sequences of bits, one for each element,
> and a model associating elements with bits.
> 
> 
> I hope this isn't to far off beam.

I think the big issue is the layering. It adds a complexity to the 
question of position, do you mean index by byte, by character or by 
comma separated value?

We want the representation of layers to be modular so that you can 
replace the string representation with a binary representation and the 
application (which is dealing with a list of numbers) does not need to 
know. It is important that the descriptions are contained and that the 
description of the integer list does not reference the underlying byte 
positions.

Layering is IMO the reason that we need some formal description, it is 
also the reason that it is hard. I would like to try to take this 
forward a little. I think in the wake of this new spec it is timely.

Mike can you give me a steer on the IP status of your document. I 
understand that it has not been submitted to the WG. Do you propose to 
submit it? I think the basic outline is consistent with things you have 
said at WG meetings (though not the level of detail). If we were to 
produce a document that contained some of these ideas would that be a 
problem?

Thanks,

Martin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20050905/e2f8ec0a/attachment.htm