[dfdl-wg] How to handle multi-dimensional arrays - version 2

Thu Mar 3 16:20:52 CST 2005

I sort-of agree :-) I think the distinctions I'm making are subtle, but
important with repect to composability/layers, but don't shift what you
can do in the multidimensional array case from where you're trying to
go. And if that doesn't make your eyes cross and cause fits, read on...

Why haven't you haven't felt it necessary to define an XSD for vectors
beyond putting dfdl:runtimeoccurs limits on how many to pull from a
stream? In this case, the runtimeoccurs param is a param of the reader
that populates a 'normal' XSD sequence with 'normal' XSD elements. For
multidimensional arrays, the runtimeoccurs parameters for each dimension
are now becoming part of the model rather than parameters of the reader.
I don't know if I like that, but, if we do it, why not do it everywhere
and make, for example,  dfdl:byteorder an attribute on all ints and
floats that are read? Of course, a byteorder attribute would only be
available if you actually came from a binary stream, which may be
defined elsewhere (some enclosing node, another layer). To me, any of
this starts to mix the model and the method used to read the model,
which gets back to the issue of how independent are readers/ does
creating a new reader imply creating a new subtype, etc.

I guess I'd rather see the concept of multidimensional arrays as
follows: there is not, in fact a multidimensional array on disk/stream,
just a serialized sequence of ints/floats/whatever. But, to assist the
user in interpreting this flat data as a multidimensional array, we want
DFDL to make index info available and, rather than just making a single
cursor count available and requiring users to do math to have indexes
that don't start at 0 (or 1 - whatever) or support multiple dimensions,
we provide some convenience mechanisms that can report an index or
indexes that cycle from user defined mins and maxes as elements are
read, which can be used to decorate elements with attributes or be used
in conditional logic, etc. This would preserve the separation of reader
from model, at the expense of saying that indexes like this are
different/ are not like all the dfdl reader parameters that might be in
the current context.

> What makes all this confusing for DFDL is that we have some 
> representations
> that are complex enough to need layered multi-step 
> descriptions, and once
> you have that, there's no stopping you from using it to do 
> all sorts of
> transformation from one format to another. So it feels like 
> you can have
> your cake and eat it too, which is to say you can pick your 
> XML Schema and
> populate it from quite differently structured data. And that 
> is probably
> true, but at the bottom level of the stack of layers you have 
> to have a
> vocabulary and model for directly describing the structure of 
> the data so as
> to get the whole ball rolling. And at this bottom layer, the needs of
> describing the data format completely dictate what the schema 
> is like.  

I would solve this by just saying that, at the bottom layer, there are
no single or multidimensional arrays, just sequences of base types, and
that any concept of dimensions is fabrication created by the user (a
very common and convenient one we might want special support for...).

The only reason I think we would need a multidimensional array type in
DFDL is if we wanted to directly read m*n bytes and create a single XML
element representing the entire array that would then have some accessor
methods to get a value for a particular x,y offset pair. I'm not sure
what kind of analogy will make sense to people here, but I see a similar
argument for floats from strings: if you want to create a float from a
sequence of characters, you need a float type. If you just want to
prescribe a standard way to model a sequence of characters representing
a float so that we can consistently label the mantissa and exponent
chars regardless of storage order, you're not really defining a new
float type in XSD. Instead, your exposing the semantics created/inferred
by the reader as standardized annotations of the existing char (or
string) type (with the annotations being potential or required depending
on whether you let me shut them off or not).

  Jim