[dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML

Fri Nov 19 11:34:18 CST 2004

	Unfortuantely, there's a slippery slope here - there are no ints
on the disk, just logical ones and zeros that you can transform into a
second logical structure composed of ints, assuming you specify byte
order. I think we have a whole stream of examples beyond that - removing
delimiters, using a length prefix to define the length of a subsequent
structure, etc. - that we see as minor transformations to something
still relatively "compliant" with the physical structure, but, I
believe, require the same machinery as things I think we will all agree
are beyond the scope of what DFDL should aim for.
	 
	In practice, I think people should get out of DFDL as soon as
possible just as you say - use other technologies once you get an
initial structure. But I think there are cases where you have to stay in
DFDL - anything where I have to transform the initial
physically-compliant structure to interpret subsequent fields - x and y
ints tell me how many pixel repeats, an int greater than another int
read previsouly implies a different subsequent structure, etc. And
again, the minimal mechinery to do that lets you go farther than you'd
want people to go in practice.
	 
	There may also be reasonable use cases where the ability to stay
in DFDL is important. For example, take digital preservation, where I
might want to map all document files to a standardized schema,
regardless of whether it was word, pdf, etc. Being able to specify the
full descriptions in one file that then requires only one parser to
interpret all formats *might* be worth the cost to do complex things in
DFDL. I don't think our goal for a version 1 should be to support such
use, but I don't think we can meet our simple goals without
'accidentally' making it possible.
	 
	I'd be happy to be proved wrong - seems like a deep point that
would be cool to understand. I'm not sure how we get to a 'proof' though
- we're trying to prove that there exists something DFDL as currently
formulated can't describe. So - we either need to find that example or
turn to some sort of logic formalism to discover what primitive(s) we're
missing that keep us for emulating some class of parser/programming. (Or
find something in DFDL that we don't need to support the examples we do
want to target...).
	 
	  Jim
	 
	 
	 -----Original Message-----
	From: owner-dfdl-wg at ggf.org [mailto:owner-dfdl-wg at ggf.org] On
Behalf Of Suman Kalia
	Sent: Friday, November 19, 2004 11:50 AM
	To: dfdl-wg at gridforum.org
	Subject: Fw: [dfdl-wg] simple way to study hard DFDL example
problem - IBMFormat VS rec ords as XML
	
	
	I tend to agree that there 2 inherent logical structures in this
scenario.  DFDL scope in my option should be restricted to parsing the
physical stream and populating the logical structure which is complaint
with the structure of physical stream and vice versa.  We have numerous
options and technologies (XSLT, XSD<->XSD mappers, good old programming
languages, Xquery) which do pretty good job to transform one logical
structure to another logical structure.  Building some kinds of
annotations which would allow a physical stream to map to a completely
different logical structure will make the DFDL language very complex. 
	
	Suman Kalia
	IBM Toronto Lab
	WebSphere Business Integration Application Connectivity Tools 
	Tel : 905-413-3923  T/L  969-3923
	Fax : 905-413-4850
	Internet ID : kalia at ca.ibm.com 
	----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 11:36
AM ----- 
	
"Myers, James D" <jim.myers at pnl.gov> 
Sent by: owner-dfdl-wg at ggf.org 

11/19/2004 11:05 AM 

To
dfdl-wg at gridforum.org 
cc
Subject
RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat
VS rec        ords as XML

	
	 I was thinking that step 1 involved recognizing the <first/>
and <data> elements and creating a sequence of <myfirst>here's the
data</myfirst>, <mymiddle>more data</mymiddle> and <mylast>... elements
and then assembling the new layer by some sort of choice to concatenate
the relevant myfirst, optional mymiddle, and myend elements for each
item. 
	  
	I think that requires a way to make a choice based on the
<first/>, <middle/>, <last/> elements and populate either a <myfirst>,
<mymiddle>, or <mylast> elements (all subtypes of string?) with the
contents of the following data element, which I think we can do in DFDL.
This is just our standard choice flag that decides which of several
options exist. 
	  
	Then, I think you'd need logic to decide how many elements
represent one item, which I think we have, followed by a way to
concatenate these elements to produce a string source, which again I
think we have (same as saying a complex can be built from two floats
referenced from another layer instead of from a float stream). This part
is the same problem as having a text file where one <CR> separates lines
and <CR><CR> separates paragraphs and you want to create single strings
(from a variable number of lines) for each paragraph. 
	  
	Again, I won't argue that this is simple and fun, but I think
the machinery exists and is the same as that from our simple examples. 
	  
	  Jim 
	  
	  
	 -----Original Message-----
	From: owner-dfdl-wg at ggf.org [mailto:owner-dfdl-wg at ggf.org] On
Behalf Of mike.beckerle at ascentialsoftware.com
	Sent: Friday, November 19, 2004 10:44 AM
	To: Myers, James D; dfdl-wg at gridforum.org
	Subject: RE: [dfdl-wg] simple way to study hard DFDL example
problem - IBMFormat VS rec ords as XML
	
	You are thinking along the lines I was; however, the challenge
is that I cannot find a way to do this using multilayer so I'm
uncomfortable suggesting that it's possible at all anymore. Here's some
reasoning why. 
	  
	In particular, it's the intersection of the induction across the
items with the first, middle*, last thing, and the spanning that seems
to defy my efforts to cut it up into progressive transformation layer by
layer. In some conversations I've referred to this problem as the
"non-conforming trees" problem. The fundamental shapes of the trees are
not compatible, and expressing the transformation between them isn't
easily done via induction of any kind on one or the other of the trees. 
	  
	To me the First, Middle*, Last thing is very problematic. It's
effectively a little regular language (in the formal sense) that has to
be recognized. Generally this requires a finite-state-machine, and what
makes FSMs interesting and complex is always the way you diagnose
malformed data in addition to recognizing correct data. 
	  
	Now, a finite-state-machine is, to my mind, the ultimate
procedural abstraction, the quintessential opposite of "declarative"
expression. To be declarative about a FSM you end up saying "recognize
this regular language", and providing a description of the regular
language, which is of course, just begging the question of how it
actually works. 
	  
	(And for us, we're not really talking about a regular language
of character text, but a pattern of usage in the binary data layout that
obeys the pattern of a regular language. So it's not like having a
little regular expression thing for validating text strings helps with
this problem.) 
	  
	I guess I'm arguing that a black box approach to this is not
only acceptable, but is highly likely to be the only "good" way to do
it. In light of this I've suggested a rep property called "streamFormat"
(perhaps should be renamed "recordFormat"), which gets values from the
set VS, V, VBS, FB, FBS, etc. etc. all these well-defined legacy data
formats (there are 19 of them I think).  In additon, one should be able
to extend this by introduction of a blackbox transformation. 
	  
	And ... here's the rub...if that's true for this case, then
other "hard" examples like run-length encoding seem also in this
category.   
	  
	There's several "leaps of faith" just made in these arguments,
so i'd still like people to take this "XML challenge" and see if there's
some magic I'm overlooking. 
	  
	...mikeb 
	  
	  
________________________________

	From: Myers, James D [mailto:jim.myers at pnl.gov] 
	Sent: Friday, November 19, 2004 9:52 AM
	To: dfdl-wg at gridforum.org
	Subject: RE: [dfdl-wg] simple way to study hard DFDL example
problem - IBM Format VS rec ords as XML
	
	Without digging too much into the details, I'd say this is an
example where multi-layer comes in. The DFDL would describe a hidden
layer in which the first, middle, last data elements would be identified
and put into a list, and then that hidden list would be used as the
input to create items in the output layer. 
	  
	I think this is conceptually similar to one of our run-length
encoding examples (more complex of course). If you read a sequence if
ints and then a sequence of floats and need to output a sequence of
floats with int[i] repeats of float[i], it would be easiest to create a
hidden layer representing the int and float sequences and to then
produce output from that. If you don't think about a layer, even this
example gets painful - I need to read an int, skip forward somewhere to
find a float, skip back to get the next int, etc. 
	  
	Mike's full example, not starting with the XML-ized version,
might be something that requires more than one layer - read the original
into something with with XML schema Mike defines, then a layer making a
sequence of data elements, and then something that has the desired
logical output. 
	  
	I guess I would claim that this would not be too bad a way to
describe a fairly complex format in terms of a fairly different logical
structure. Whether one *should* do this in DFDL, or whether it would
make more sense to a) write a black box parser to get to items, or b)
use DFDL to get to the initial schema Mike wrote and use XSLT afterwards
to convert to the desired logical structure. I think there are enough
cases where we need the multilayer functionality in DFDL that are
relatively simple that we have to have it, which means it will then be
possible to deal with complex transformations in DFDL even if not
simple/practical. 
	  
	  Jim 
	  
	 -----Original Message-----
	From: owner-dfdl-wg at ggf.org [mailto:owner-dfdl-wg at ggf.org] On
Behalf Of mike.beckerle at ascentialsoftware.com
	Sent: Thursday, November 18, 2004 9:53 PM
	To: dfdl-wg at gridforum.org
	Subject: [dfdl-wg] simple way to study hard DFDL example problem
- IBM Format VS rec ords as XML
	
	I've come up with a way to articulate the difficulties I'm
having with DFDL for complex file formats. 
	  
	This problem may not be that hard for someone with more XML,
XPath or XQuery experience, so I'd apprecate it if you could look it
over and if necessary even run it by your resident XML experts. 
	  
	In case the emailer mangles all the line lengths, I've also
attached the below as a file. 
	  
	<!-- Example motivated by DFDL for IBM Format-VS -->
	<!-- see http://tinyurl.com/3s2bq <http://tinyurl.com/3s2bq>
for details on IBM Format-VS --> 
	  
	<!-- Logically, our data is this: --> 
	  
	<ITEM>The first item</ITEM>
	<ITEM>This is the second item</ITEM>
	<ITEM>The third</ITEM> 
	  
	<!-- That is, data having this "logical" schema --> 
	  
	<sequence>
	 <element name="ITEM" type="string" minOccurs="0"
maxOccurs="unbounded"/>
	</sequence> 
	  
	<!-- But the below is the input data were starting from. What
you see below simulates
	    the structural issues of IBM Format-VS, but converting the
problem into an XML to XML
	    transformation problem --> 
	  
	<BLOCK>
	 <SEGMENT>
	   <WHOLE/> <!-- a WHOLE segment holds a whole item (Duh!). This
element is really a type tag. -->
	   <DATA>The first item</DATA>  
	 </SEGMENT>
	</BLOCK> 
	  
	<BLOCK>
	 <SEGMENT>
	   <FIRST/> <!-- a FIRST segment holds the first part of an
item. -->
	   <DATA>Thi</DATA>
	 </SEGMENT>
	</BLOCK> 
	  
	<BLOCK>
	 <SEGMENT>
	   <MIDDLE/> <!-- a MIDDLE segment holds data from the center of
an item -->
	   <DATA>s is t</DATA>
	 </SEGMENT>
	</BLOCK> 
	  
	<BLOCK>
	 <SEGMENT>
	   <MIDDLE/> 
	   <DATA>he sec</DATA>
	 </SEGMENT>
	</BLOCK> 
	  
	<BLOCK>
	 <SEGMENT>
	   <LAST/> <!-- a LAST segment holds data from the end of the
item.  -->
	   <DATA>ond item</DATA>
	 </SEGMENT>
	 <SEGMENT>
	   <WHOLE/><!-- This second segment in this block is a WHOLE
segment. However 
	                in general the 2nd segment of a block could be a
WHOLE or the 
	                FIRST segment of another multi-segment
multi-block spanning item -->
	   <DATA>Third item</DATA>
	 </SEGMENT>
	</BLOCK> 
	  
	<!-- Some observations: -->
	<!-- Data is organized into BLOCKs -->
	<!-- Each block contains 1 or 2 SEGMENTs -->
	<!-- Each SEGMENT is either a WHOLE item, or the item spans 2 or
more SEGMENTs -->
	<!-- Spanning data is broken on arbitrary boundaries across
segments it spans -->
	<!-- Spanning involves a FIRST, MIDDLE*, LAST segment structure.
-->
	<!-- MIDDLE* means zero or more MIDDLE segments. --> 
	  
	<!-- The question: how can we express the transformation into
the desired logical form?
	    Or is this beyond the call of duty for DFDL?
	    Goals include to be as declarative as possible, and ideally,
do it as a set of
	    XML Schema annotations in the GGF DFDL style.  --> 
	  
	<!-- here's an XSD (untested) for the input data structure --> 
	  
	<complexType name="Format_VS_t">
	<sequence>
	  <element name="BLOCK" type="Block_t" minOccurs="0"
maxOccurs="unbounded"/>
	</sequence>
	</complexType> 
	  
	<complexType name="Block_t">
	     <sequence>
	        <element name="SEGMENT" type="Segment_t" minOccurs="1"
maxOccurs="2"/>
	     </sequence>
	</complexType> 
	  
	<complexType name="Segment_t">
	<sequence>
	 <choice>
	   <element name="WHOLE">
	   </element>
	   <element name="FIRST">
	   </element>
	   <element name="LAST">
	   </element>
	   <element name="MIDDLE">
	   </element>
	 </choice>
	 <element name="DATA" type="string"/>
	</sequence>
	</complexType> 
	  
	
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20041119/deee38cd/attachment.htm