[dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML

Mon Nov 22 08:34:35 CST 2004

> The way I view physical rep information is as functions that 
> can be applied to types and fields. Writing the data out to a 
> blocked/segmented format does not fall into this category. It 
> is an orthogonal operation that applies to the whole data and 
> as such is much more akin to encryption and compression. For 
> example, I have a COBOL structure that ends up in an MQSeries 
> queue and in a QSAM file. It has a logical structure, it has 
> a physical representation. In the QSAM case a further 
> transform has taken place to block/segment the structure. I 
> would not expect to see the physical rep properties of the 
> types and elements change.

I think we've been talking about DFDL as always going TO the XML schema
and have considered the process of going FROM the XML to a new
serialization as 'inverse DFDL'. Towards that end, we've discussed being
able to mark transforms as invertible and/or allowing an inverse method
to be registered as part of the transform definition. We also talked
about the potential requirement of having multiple output streams: if I
read x and y dimensions and then pixels, but my output XML model is just
the pixel sequence, I will need to record x and y somewhere to allow
inversion, so the user (or DFDL) might want to specify x and y in some
separate 'provenance' file that could be used during inversion.

I'm not sure that this is the best model, but I don't think we've come
up with a good way to describe going from the XML model except as the
inverse of the to process.
> 
> Mike's idea of a schema level 'stream' rep property sounds ok 
> in principle for parsing, but what other metadata is needed 
> when serialising? How are we informed of the rules for VB 
> blocking or for IMS segmentation? Are they fixed or 
> user-defined? If these rules end up requiring extra metadata 
> at the type/element level then I am not comfortable with 
> this, because we are mixing two sets of physical information.
> 
> I think that whatever principles we apply to DFDL 
> including/excluding encryption and compression we should also 
> apply to these formats.  What is the current proposal in this 
> area? The cheapest option would be to provide a flexible 
> user-defined transform capability.

We planned to have a user-defined transform capability that would appear
in the same way as DFDL-standard transforms. I think one can easily put
something like zip into the same format as Alan has done for the basic
int from ascii, int from binary transforms, as a byte sequence to byte
sequence transform. I think I'd vote for just including zip since it
will be used in a number of formats, but one could imagine a user adding
a de-pig-latinizer as needed. (Pig latin, and things like run-length
encoding are examples we've used to point out that not all
compression/encryption type algorithms will run on the raw input stream
- both of these require some level of parsing before you can use them -
to find words or to get the <value, # of repeats > pairs from the
initial bytes.