Fw: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML

Fri Nov 19 14:00:04 CST 2004

Jim -- I agree with most of your assertions and  you have phrased it right 
 "relatively compliant with physical structure".  Some of these examples 
from programming languages would be " COBOL occur depending upon clause" 
and as you mentioned in the example "a previous value in the structure 
indicating which field in the choice will be present  or how many 
occurrences a subsequent field will have"  etc..  These are the most 
common kind of constructs that occur quite frequently in the programming 
structures.   

I think DFDL standard is addressing a very critical requirement "rendering 
a logical structure to a relatively compliant physical format and vice 
versa"  which no other public standard has addressed so far to  my 
knowledge and this work is/will be very complimentary with other 
standards.   

Suman Kalia
IBM Toronto Lab
WebSphere Business Integration Application Connectivity Tools 
Tel : 905-413-3923  T/L  969-3923
Fax : 905-413-4850
Internet ID : kalia at ca.ibm.com 
----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 02:05 PM ----- 
"Myers, James D" <jim.myers at pnl.gov> 
Sent by: owner-dfdl-wg at ggf.org 
11/19/2004 12:34 PM 

To
dfdl-wg at gridforum.org 
cc

Subject
RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS 
rec        ords as XML

Unfortuantely, there's a slippery slope here - there are no ints on the 
disk, just logical ones and zeros that you can transform into a second 
logical structure composed of ints, assuming you specify byte order. I 
think we have a whole stream of examples beyond that - removing 
delimiters, using a length prefix to define the length of a subsequent 
structure, etc. - that we see as minor transformations to something still 
relatively "compliant" with the physical structure, but, I believe, 
require the same machinery as things I think we will all agree are beyond 
the scope of what DFDL should aim for. 

In practice, I think people should get out of DFDL as soon as possible 
just as you say - use other technologies once you get an initial 
structure. But I think there are cases where you have to stay in DFDL - 
anything where I have to transform the initial physically-compliant 
structure to interpret subsequent fields - x and y ints tell me how many 
pixel repeats, an int greater than another int read previsouly implies a 
different subsequent structure, etc. And again, the minimal mechinery to 
do that lets you go farther than you'd want people to go in practice. 

There may also be reasonable use cases where the ability to stay in DFDL 
is important. For example, take digital preservation, where I might want 
to map all document files to a standardized schema, regardless of whether 
it was word, pdf, etc. Being able to specify the full descriptions in one 
file that then requires only one parser to interpret all formats *might* 
be worth the cost to do complex things in DFDL. I don't think our goal for 
a version 1 should be to support such use, but I don't think we can meet 
our simple goals without 'accidentally' making it possible. 

I'd be happy to be proved wrong - seems like a deep point that would be 
cool to understand. I'm not sure how we get to a 'proof' though - we're 
trying to prove that there exists something DFDL as currently formulated 
can't describe. So - we either need to find that example or turn to some 
sort of logic formalism to discover what primitive(s) we're missing that 
keep us for emulating some class of parser/programming. (Or find something 
in DFDL that we don't need to support the examples we do want to 
target...). 

  Jim 

 -----Original Message-----
From: owner-dfdl-wg at ggf.org [mailto:owner-dfdl-wg at ggf.org] On Behalf Of 
Suman Kalia
Sent: Friday, November 19, 2004 11:50 AM
To: dfdl-wg at gridforum.org
Subject: Fw: [dfdl-wg] simple way to study hard DFDL example problem - 
IBMFormat VS rec ords as XML

I tend to agree that there 2 inherent logical structures in this scenario. 
 DFDL scope in my option should be restricted to parsing the physical 
stream and populating the logical structure which is complaint with the 
structure of physical stream and vice versa.  We have numerous options and 
technologies (XSLT, XSD<->XSD mappers, good old programming languages, 
Xquery) which do pretty good job to transform one logical structure to 
another logical structure.  Building some kinds of annotations which would 
allow a physical stream to map to a completely different logical structure 
will make the DFDL language very complex. 

Suman Kalia
IBM Toronto Lab
WebSphere Business Integration Application Connectivity Tools 
Tel : 905-413-3923  T/L  969-3923
Fax : 905-413-4850
Internet ID : kalia at ca.ibm.com 
----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 11:36 AM ----- 
"Myers, James D" <jim.myers at pnl.gov> 
Sent by: owner-dfdl-wg at ggf.org 
11/19/2004 11:05 AM 

To
dfdl-wg at gridforum.org 
cc

Subject
RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS 
rec        ords as XML

I was thinking that step 1 involved recognizing the <first/> and <data> 
elements and creating a sequence of <myfirst>here's the data</myfirst>, 
<mymiddle>more data</mymiddle> and <mylast>... elements and then 
assembling the new layer by some sort of choice to concatenate the 
relevant myfirst, optional mymiddle, and myend elements for each item. 

I think that requires a way to make a choice based on the <first/>, 
<middle/>, <last/> elements and populate either a <myfirst>, <mymiddle>, 
or <mylast> elements (all subtypes of string?) with the contents of the 
following data element, which I think we can do in DFDL. This is just our 
standard choice flag that decides which of several options exist. 

Then, I think you'd need logic to decide how many elements represent one 
item, which I think we have, followed by a way to concatenate these 
elements to produce a string source, which again I think we have (same as 
saying a complex can be built from two floats referenced from another 
layer instead of from a float stream). This part is the same problem as 
having a text file where one <CR> separates lines and <CR><CR> separates 
paragraphs and you want to create single strings (from a variable number 
of lines) for each paragraph. 

Again, I won't argue that this is simple and fun, but I think the 
machinery exists and is the same as that from our simple examples. 

 Jim 

-----Original Message-----
From: owner-dfdl-wg at ggf.org [mailto:owner-dfdl-wg at ggf.org] On Behalf Of 
mike.beckerle at ascentialsoftware.com
Sent: Friday, November 19, 2004 10:44 AM
To: Myers, James D; dfdl-wg at gridforum.org
Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem - 
IBMFormat VS rec ords as XML

You are thinking along the lines I was; however, the challenge is that I 
cannot find a way to do this using multilayer so I'm uncomfortable 
suggesting that it's possible at all anymore. Here's some reasoning why. 

In particular, it's the intersection of the induction across the items 
with the first, middle*, last thing, and the spanning that seems to defy 
my efforts to cut it up into progressive transformation layer by layer. In 
some conversations I've referred to this problem as the "non-conforming 
trees" problem. The fundamental shapes of the trees are not compatible, 
and expressing the transformation between them isn't easily done via 
induction of any kind on one or the other of the trees. 

To me the First, Middle*, Last thing is very problematic. It's effectively 
a little regular language (in the formal sense) that has to be recognized. 
Generally this requires a finite-state-machine, and what makes FSMs 
interesting and complex is always the way you diagnose malformed data in 
addition to recognizing correct data. 

Now, a finite-state-machine is, to my mind, the ultimate procedural 
abstraction, the quintessential opposite of "declarative" expression. To 
be declarative about a FSM you end up saying "recognize this regular 
language", and providing a description of the regular language, which is 
of course, just begging the question of how it actually works. 

(And for us, we're not really talking about a regular language of 
character text, but a pattern of usage in the binary data layout that 
obeys the pattern of a regular language. So it's not like having a little 
regular expression thing for validating text strings helps with this 
problem.) 

I guess I'm arguing that a black box approach to this is not only 
acceptable, but is highly likely to be the only "good" way to do it. In 
light of this I've suggested a rep property called "streamFormat" (perhaps 
should be renamed "recordFormat"), which gets values from the set VS, V, 
VBS, FB, FBS, etc. etc. all these well-defined legacy data formats (there 
are 19 of them I think).  In additon, one should be able to extend this by 
introduction of a blackbox transformation. 

And ... here's the rub...if that's true for this case, then other "hard" 
examples like run-length encoding seem also in this category.   

There's several "leaps of faith" just made in these arguments, so i'd 
still like people to take this "XML challenge" and see if there's some 
magic I'm overlooking. 

...mikeb 

From: Myers, James D [mailto:jim.myers at pnl.gov] 
Sent: Friday, November 19, 2004 9:52 AM
To: dfdl-wg at gridforum.org
Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem - IBM 
Format VS rec ords as XML

Without digging too much into the details, I'd say this is an example 
where multi-layer comes in. The DFDL would describe a hidden layer in 
which the first, middle, last data elements would be identified and put 
into a list, and then that hidden list would be used as the input to 
create items in the output layer. 

I think this is conceptually similar to one of our run-length encoding 
examples (more complex of course). If you read a sequence if ints and then 
a sequence of floats and need to output a sequence of floats with int[i] 
repeats of float[i], it would be easiest to create a hidden layer 
representing the int and float sequences and to then produce output from 
that. If you don't think about a layer, even this example gets painful - I 
need to read an int, skip forward somewhere to find a float, skip back to 
get the next int, etc. 

Mike's full example, not starting with the XML-ized version, might be 
something that requires more than one layer - read the original into 
something with with XML schema Mike defines, then a layer making a 
sequence of data elements, and then something that has the desired logical 
output. 

I guess I would claim that this would not be too bad a way to describe a 
fairly complex format in terms of a fairly different logical structure. 
Whether one *should* do this in DFDL, or whether it would make more sense 
to a) write a black box parser to get to items, or b) use DFDL to get to 
the initial schema Mike wrote and use XSLT afterwards to convert to the 
desired logical structure. I think there are enough cases where we need 
the multilayer functionality in DFDL that are relatively simple that we 
have to have it, which means it will then be possible to deal with complex 
transformations in DFDL even if not simple/practical. 

 Jim 

-----Original Message-----
From: owner-dfdl-wg at ggf.org [mailto:owner-dfdl-wg at ggf.org] On Behalf Of 
mike.beckerle at ascentialsoftware.com
Sent: Thursday, November 18, 2004 9:53 PM
To: dfdl-wg at gridforum.org
Subject: [dfdl-wg] simple way to study hard DFDL example problem - IBM 
Format VS rec ords as XML

I've come up with a way to articulate the difficulties I'm having with 
DFDL for complex file formats. 

This problem may not be that hard for someone with more XML, XPath or 
XQuery experience, so I'd apprecate it if you could look it over and if 
necessary even run it by your resident XML experts. 

In case the emailer mangles all the line lengths, I've also attached the 
below as a file. 

<!-- Example motivated by DFDL for IBM Format-VS -->
<!-- see http://tinyurl.com/3s2bq for details on IBM Format-VS --> 

<!-- Logically, our data is this: --> 

<ITEM>The first item</ITEM>
<ITEM>This is the second item</ITEM>
<ITEM>The third</ITEM> 

<!-- That is, data having this "logical" schema --> 

<sequence>
<element name="ITEM" type="string" minOccurs="0" maxOccurs="unbounded"/>
</sequence> 

<BLOCK>
<SEGMENT>
  <WHOLE/> 
  <DATA>The first item</DATA>  
</SEGMENT>
</BLOCK>

<BLOCK>
<SEGMENT>
  <MIDDLE/> <!-- a MIDDLE segment holds data from the center of an item 
-->
  <DATA>s is t</DATA>
</SEGMENT>
</BLOCK> 

<BLOCK>
<SEGMENT>
  <MIDDLE/> 
  <DATA>he sec</DATA>
</SEGMENT>
</BLOCK> 

<BLOCK>
<SEGMENT>
  <LAST/> <!-- a LAST segment holds data from the end of the item.  -->
  <DATA>ond item</DATA>
</SEGMENT>
<SEGMENT>
  <WHOLE/><!-- This second segment in this block is a WHOLE segment. 
However 
               in general the 2nd segment of a block could be a WHOLE or 
the 
               FIRST segment of another multi-segment multi-block spanning 
item -->
  <DATA>Third item</DATA>
</SEGMENT>
</BLOCK> 

<!-- here's an XSD (untested) for the input data structure --> 

<complexType name="Format_VS_t">
<sequence>
 <element name="BLOCK" type="Block_t" minOccurs="0" 
maxOccurs="unbounded"/>
</sequence>
</complexType> 

<complexType name="Block_t">
    <sequence>
       <element name="SEGMENT" type="Segment_t" minOccurs="1" 
maxOccurs="2"/>
    </sequence>
</complexType> 

<complexType name="Segment_t">
<sequence>
<choice>
  <element name="WHOLE">
  </element>
  <element name="FIRST">
  </element>
  <element name="LAST">
  </element>
  <element name="MIDDLE">
  </element>
</choice>
<element name="DATA" type="string"/>
</sequence>
</complexType> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20041119/9de7e871/attachment.htm