[dfdl-wg] How to handle multi-dimensional arrays

Wed Mar 2 17:58:02 CST 2005

I think I get your points Jim. Responding stream of consciousness
interlineated below. Later points build on earlier ones, so read it all
before typing up any reactions to earlier ones.

...mike

> 
> This approach seems to dictate how the user will represent 
> the array in XML (dictating their schema) rather than just 
> describing how to pick up the right content. Which I think we 
> agreed is a bad thing. 

Your concern seems to be with the requirment that the element have
attributes which make the position of the element explicit.  I suppose this
is only needed if you want to access the array with some sort of
multi-dimensional X-path like expression. We chose this attribute mechanism
because it seemed the most compact thing we could come up with that was
consistent with xpath-like expressions that could select a specific element
of the array. 

We could make it flexible so that you could use sub-elements to hold the
positions, not just attributes, or you could even leave them out. If you
leave them out then I don't see a way to both tolerate storage-order
variations and provide index-based access within the XSD/XML framework. 

However, if you are not translating into XML, but using an API to access the
data where that API is supplied by a DFDL-aware library and has direct
support for multi-dimensional arrays, then you certainly do not want lots of
storage for a numeric array being taken up representing redundant attributes
that just hold the coordinate positions. Hence, these dimension attributes
really should be hidden, and not parts (elements or attributes) of the
logical data model. 

> That's not to say this isn't a 
> reasonable way to represent a multidim array in XML, just 
> that having DFDL go look at the attributes outside the 
> annotation and requiring this formatting in the output XML 
> doesn't seem right. 
> 

Point taken. 

However, can you suggest what might be alternatives for concrete output XML
for a multidim array? I'm kind of stuck in the rut of this one currently.
Everything else I look at feels awkward in comparison, but this may just be
failure of imagination.

> In
>       <dfdl:dataFormat arrayStorageOrder="@y @x">
> 
> it seems that DFDL only needs the dimension sizes for its own 
> purposes and we could probably use our referencing mechanism 
> to get them (i.e. if I read the two dimension ints earlier 
> and want to reference them for the array sizes) - maybe some 
> kind of <dfdl:runtimeoccurs> elements for the n dimensions?

(BTW: since the earlier mailing, in our prototype this rep property has been
renamed to arrayDimensions, since it serves to identify the attributes which
provide the dimensions of the element. The order of the elements in the list
continues to be used to specify the storage order.)

>   
> To allow the kind of output XML you propose, we probably need 
> something new to allow you to loop. 

I don't see why we need a loop. Well not yet anyway. 

I can clearly see needing to write expressions that need to know the
coordinates of the current element's position in the array. I was thinking
of that as using xpath-like expressions that would refer to the "dimension
attributes". I.e., if I have a 2d array with axes x and y, then ./@x is this
element's x-axis position, and ./@y is this element's y-axis position. In a
single-dimensional case with unnamed axis, then  ./position() is the X-path
way of doing this. 

So far I think the DFDL system can do the iterations, you just have to write
the "loop body" which is called over and over again. E.g., this would let
you do stuff like combine two conforming arrays of scalars to create an
array of tuples. Another important example is that it lets you do stuff like
have the null-indicators for an array of nullable values be stored in a
separate array of individual flag bits. 

I'm rather engaged in trying to see how far we can get without putting in
any sort of looping or recursion constructs currently since with layering
and the ability to write expressions that refer to the "current element's
position" or its multidimensional analogue, and to dimension things using
the sizes of other things as parameters I think we can do very powerful
things quite easily but still declaratively. I'm not philosophically opposed
to having more power, I just don't want to put it in until the need is
clear. I haven't seen any examples yet where an explicit loop, not an
implied loop, is truly needed. 

> If the only place we need 
> looping is for multidimensional arrays (and the special case 
> of one dim), perhaps we can do something similar to what you 
> propose and essentially have the array mechanism define some 
> loop variables that can be referenced (a dfdl layer?). 

I suppose one could use a layer to hide the dimension attributes from the
logical model. If the layer was named "rep", then the arrayDimensions
property would be "./rep/@y ./rep/@x", and they wouldn't appear in the
actual XML logical model of the data. You just wouldn't then have any X-path
like way of indexing the resulting logical structure without digging into
the rep. It would be up to you whether you want them in the logical
structure or not. 

A layer is perhaps not abstract enough. You don't really want these things
to be realized at all, not in a layer nor in the logical model. 

I'll give this some more thought and there will be some subsequent proposal.
Along the lines of your #currentcursorvalue# thing below. The idea is you
get an expression that accesses the current position on any dimension of a
multi-dimensional array, and it's up to you if you want to populate an
attribute subelement or otherwise with it or not. 

> don't have a full proposal thought out, but imagine defining 
> an array as a stream that can be referenced using multiple 
> dimensions rather than a single cursor, and having a 
> mechanism so that the current value of the cursor(s) are 
> available to the user.
> 
> So, from the Reference.xsd example, we might want to have an 
> attribute that shows the x value of the xdata elements 
> analogous to the multidim
> example:
> 
>                 <xs:element name="xdata" type="xs:float"
> maxOccurs="unbounded">
>                     <xs:annotation>
>                         <xs:appinfo>
>  
> <dfdl:runtimeoccurs>../x</dfdl:runtimeoccurs>
>                         </xs:appinfo>
>                     </xs:annotation>
>                     <xs:attribute name="x">
>                        <dfdl:runtimevalue = #currentcursorvalue#/>
>                     </xs:attribute>
>                 </xs:element>
> 
> where #currentcursorvalue# is something we have not yet made 
> available for output (or is this available via xpath - the 
> position of the current element in a sequence?). 

Yes in xpath position() for a 1d array fills this role. 

> This would 
> change the Reference.xml example output to have elements like
> 
>   <xdata x="1">2.78</xdata>
>   <xdata x="2">3.14</xdata>
> 
>  

Right. So the user chooses whether to tuck the position into an attribute or
not. 

> So, if I can summarize/rephrase, I think we should keep the 
> mechanism for single or multidim arrays separate from how the 
> output is displayed, but I like the idea of making the 
> current cursor(s) available for use, which I don't think 
> we've done yet. And having a real multidimension construct 
> rather than calculating them from a flat cursor is probably a 
> requirement for scientific use, so some multidim analog of 
> dfdl:runtimeoccurs is needed.
> 

Ah, runtimeoccurs. My example was fixed length so this didn't come up. I
have a dfdl:lengthCalc expression, and also dfdl:storedLength path property
in the prototype today. These need to generalize for multidimensional stuff.
I'll put that in the next version.

...mikeb