[dfdl-wg] simple way to study hard DFDL example problem - IBM Format VS rec ords as XML

Fri Nov 19 08:51:36 CST 2004

Without digging too much into the details, I'd say this is an example
where multi-layer comes in. The DFDL would describe a hidden layer in
which the first, middle, last data elements would be identified and put
into a list, and then that hidden list would be used as the input to
create items in the output layer.

I think this is conceptually similar to one of our run-length encoding
examples (more complex of course). If you read a sequence if ints and
then a sequence of floats and need to output a sequence of floats with
int[i] repeats of float[i], it would be easiest to create a hidden layer
representing the int and float sequences and to then produce output from
that. If you don't think about a layer, even this example gets painful -
I need to read an int, skip forward somewhere to find a float, skip back
to get the next int, etc.

Mike's full example, not starting with the XML-ized version, might be
something that requires more than one layer - read the original into
something with with XML schema Mike defines, then a layer making a
sequence of data elements, and then something that has the desired
logical output.

I guess I would claim that this would not be too bad a way to describe a
fairly complex format in terms of a fairly different logical structure.
Whether one *should* do this in DFDL, or whether it would make more
sense to a) write a black box parser to get to items, or b) use DFDL to
get to the initial schema Mike wrote and use XSLT afterwards to convert
to the desired logical structure. I think there are enough cases where
we need the multilayer functionality in DFDL that are relatively simple
that we have to have it, which means it will then be possible to deal
with complex transformations in DFDL even if not simple/practical.

  Jim

 -----Original Message-----
From: owner-dfdl-wg at ggf.org [mailto:owner-dfdl-wg at ggf.org] On Behalf Of
mike.beckerle at ascentialsoftware.com
Sent: Thursday, November 18, 2004 9:53 PM
To: dfdl-wg at gridforum.org
Subject: [dfdl-wg] simple way to study hard DFDL example problem - IBM
Format VS rec ords as XML

	I've come up with a way to articulate the difficulties I'm
having with DFDL for complex file formats.

	This problem may not be that hard for someone with more XML,
XPath or XQuery experience, so I'd apprecate it if you could look it
over and if necessary even run it by your resident XML experts.

	In case the emailer mangles all the line lengths, I've also
attached the below as a file.

	<!-- Example motivated by DFDL for IBM Format-VS -->
	<!-- see http://tinyurl.com/3s2bq for details on IBM Format-VS
-->

	<!-- Logically, our data is this: -->

	<ITEM>The first item</ITEM>
	<ITEM>This is the second item</ITEM>
	<ITEM>The third</ITEM>

	<!-- That is, data having this "logical" schema -->

	<sequence>
	  <element name="ITEM" type="string" minOccurs="0"
maxOccurs="unbounded"/>
	</sequence>

	<!-- But the below is the input data were starting from. What
you see below simulates
	     the structural issues of IBM Format-VS, but converting the
problem into an XML to XML
	     transformation problem -->

	<BLOCK>
	  <SEGMENT>
	    <WHOLE/> <!-- a WHOLE segment holds a whole item (Duh!).
This element is really a type tag. -->
	    <DATA>The first item</DATA>  
	  </SEGMENT>
	</BLOCK>

	<BLOCK>
	  <SEGMENT>
	    <FIRST/> <!-- a FIRST segment holds the first part of an
item. -->
	    <DATA>Thi</DATA>
	  </SEGMENT>
	</BLOCK>

	<BLOCK>
	  <SEGMENT>
	    <MIDDLE/> <!-- a MIDDLE segment holds data from the center
of an item -->
	    <DATA>s is t</DATA>
	  </SEGMENT>
	</BLOCK>

	<BLOCK>
	  <SEGMENT>
	    <MIDDLE/> 
	    <DATA>he sec</DATA>
	  </SEGMENT>
	</BLOCK>

	<BLOCK>
	  <SEGMENT>
	    <LAST/> <!-- a LAST segment holds data from the end of the
item.  -->
	    <DATA>ond item</DATA>
	  </SEGMENT>
	  <SEGMENT>
	    <WHOLE/><!-- This second segment in this block is a WHOLE
segment. However 
	                 in general the 2nd segment of a block could be
a WHOLE or the 
	                 FIRST segment of another multi-segment
multi-block spanning item -->
	    <DATA>Third item</DATA>
	  </SEGMENT>
	</BLOCK>

	<!-- Some observations: -->
	<!-- Data is organized into BLOCKs -->
	<!-- Each block contains 1 or 2 SEGMENTs -->
	<!-- Each SEGMENT is either a WHOLE item, or the item spans 2 or
more SEGMENTs -->
	<!-- Spanning data is broken on arbitrary boundaries across
segments it spans -->
	<!-- Spanning involves a FIRST, MIDDLE*, LAST segment structure.
-->
	<!-- MIDDLE* means zero or more MIDDLE segments. -->

	<!-- The question: how can we express the transformation into
the desired logical form?
	     Or is this beyond the call of duty for DFDL?
	     Goals include to be as declarative as possible, and
ideally, do it as a set of
	     XML Schema annotations in the GGF DFDL style.  --> 

	<!-- here's an XSD (untested) for the input data structure -->

	<complexType name="Format_VS_t">
	 <sequence>
	   <element name="BLOCK" type="Block_t" minOccurs="0"
maxOccurs="unbounded"/>
	 </sequence>
	</complexType>

	<complexType name="Block_t">
	      <sequence>
	         <element name="SEGMENT" type="Segment_t" minOccurs="1"
maxOccurs="2"/>
	      </sequence>
	</complexType>

	<complexType name="Segment_t">
	 <sequence>
	  <choice>
	    <element name="WHOLE">
	    </element>
	    <element name="FIRST">
	    </element>
	    <element name="LAST">
	    </element>
	    <element name="MIDDLE">
	    </element>
	  </choice>
	  <element name="DATA" type="string"/>
	 </sequence>
	</complexType>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20041119/70935551/attachment.htm