[dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML

Fri Nov 19 10:57:02 CST 2004

I think Jim is right here. I think that this problem is conceptually 
less troublesome than some of the text processing examples that we have 
considered and decided were requirements. You can produce the logical 
structure of this IBM data by adding comments to a very simple file. 
Consider a file with line separated data:

First item
Second item
Third item

Then add C-style comments:

First item
Sec/* comment
crosses a line */ond/* another
comment */ item
Third item

I think this has the same logical structure as Mike's example. I also 
think we have agreed we need to handle it. Our text processing examples 
get a lot more involved than this.

Perhaps this doesn't help, because we know that we can throw in a 
regular expression to parse out the comments and then the problem is 
easy. But suppose we didn't suppose we split by line feed and then had 
to compose the middle line do we have the machinery to do this?

Martin

Myers, James D wrote:
> 
>      I was thinking that step 1 involved recognizing the <first/> and
>     <data> elements and creating a sequence of <myfirst>here's the
>     data</myfirst>, <mymiddle>more data</mymiddle> and <mylast>...
>     elements and then assembling the new layer by some sort of choice to
>     concatenate the relevant myfirst, optional mymiddle, and myend
>     elements for each item.
>      
>     I think that requires a way to make a choice based on the <first/>,
>     <middle/>, <last/> elements and populate either a <myfirst>,
>     <mymiddle>, or <mylast> elements (all subtypes of string?) with the
>     contents of the following data element, which I think we can do in
>     DFDL. This is just our standard choice flag that decides which of
>     several options exist.
>      
>     Then, I think you'd need logic to decide how many elements represent
>     one item, which I think we have, followed by a way to concatenate
>     these elements to produce a string source, which again I think we
>     have (same as saying a complex can be built from two floats
>     referenced from another layer instead of from a float stream). This
>     part is the same problem as having a text file where one <CR>
>     separates lines and <CR><CR> separates paragraphs and you want to
>     create single strings (from a variable number of lines) for each
>     paragraph.
>      
>     Again, I won't argue that this is simple and fun, but I think the
>     machinery exists and is the same as that from our simple examples.
>      
>       Jim
>      
>      
>      -----Original Message-----
>     *From:* owner-dfdl-wg at ggf.org [mailto:owner-dfdl-wg at ggf.org] *On
>     Behalf Of *mike.beckerle at ascentialsoftware.com
>     *Sent:* Friday, November 19, 2004 10:44 AM
>     *To:* Myers, James D; dfdl-wg at gridforum.org
>     *Subject:* RE: [dfdl-wg] simple way to study hard DFDL example
>     problem - IBMFormat VS rec ords as XML
> 
>     You are thinking along the lines I was; however, the challenge is
>     that I cannot find a way to do this using multilayer so I'm
>     uncomfortable suggesting that it's possible at all anymore. Here's
>     some reasoning why.
>      
>     In particular, it's the intersection of the induction across the
>     items with the first, middle*, last thing, and the spanning that
>     seems to defy my efforts to cut it up into progressive
>     transformation layer by layer. In some conversations I've referred
>     to this problem as the "non-conforming trees" problem. The
>     fundamental shapes of the trees are not compatible, and expressing
>     the transformation between them isn't easily done via induction of
>     any kind on one or the other of the trees.
>      
>     To me the First, Middle*, Last thing is very problematic. It's
>     effectively a little regular language (in the formal sense) that has
>     to be recognized. Generally this requires a finite-state-machine,
>     and what makes FSMs interesting and complex is always the way you
>     diagnose malformed data in addition to recognizing correct data.
>      
>     Now, a finite-state-machine is, to my mind, the ultimate procedural
>     abstraction, the quintessential opposite of "declarative"
>     expression. To be declarative about a FSM you end up saying
>     "recognize this regular language", and providing a description of
>     the regular language, which is of course, just begging the question
>     of how it actually works.
>      
>     (And for us, we're not really talking about a regular language of
>     character text, but a pattern of usage in the binary data layout
>     that obeys the pattern of a regular language. So it's not like
>     having a little regular expression thing for validating text strings
>     helps with this problem.)
>      
>     I guess I'm arguing that a black box approach to this is not only
>     acceptable, but is highly likely to be the only "good" way to do it.
>     In light of this I've suggested a rep property called "streamFormat"
>     (perhaps should be renamed "recordFormat"), which gets values from
>     the set VS, V, VBS, FB, FBS, etc. etc. all these well-defined legacy
>     data formats (there are 19 of them I think).  In additon, one should
>     be able to extend this by introduction of a blackbox transformation.
>      
>     And ... here's the rub...if that's true for this case, then other
>     "hard" examples like run-length encoding seem also in this category.  
>      
>     There's several "leaps of faith" just made in these arguments, so
>     i'd still like people to take this "XML challenge" and see if
>     there's some magic I'm overlooking.
>      
>     ...mikeb
>      
>      
> 
>         ------------------------------------------------------------------------
>         *From:* Myers, James D [mailto:jim.myers at pnl.gov]
>         *Sent:* Friday, November 19, 2004 9:52 AM
>         *To:* dfdl-wg at gridforum.org
>         *Subject:* RE: [dfdl-wg] simple way to study hard DFDL example
>         problem - IBM Format VS rec ords as XML
> 
>         Without digging too much into the details, I'd say this is an
>         example where multi-layer comes in. The DFDL would describe a
>         hidden layer in which the first, middle, last data elements
>         would be identified and put into a list, and then that hidden
>         list would be used as the input to create items in the output layer.
>          
>         I think this is conceptually similar to one of our run-length
>         encoding examples (more complex of course). If you read a
>         sequence if ints and then a sequence of floats and need to
>         output a sequence of floats with int[i] repeats of float[i], it
>         would be easiest to create a hidden layer representing the int
>         and float sequences and to then produce output from that. If you
>         don't think about a layer, even this example gets painful - I
>         need to read an int, skip forward somewhere to find a float,
>         skip back to get the next int, etc.
>          
>         Mike's full example, not starting with the XML-ized
>         version, might be something that requires more than one layer -
>         read the original into something with with XML schema Mike
>         defines, then a layer making a sequence of data elements, and
>         then something that has the desired logical output.
>          
>         I guess I would claim that this would not be too bad a way to
>         describe a fairly complex format in terms of a fairly different
>         logical structure. Whether one *should* do this in DFDL, or
>         whether it would make more sense to a) write a black box parser
>         to get to items, or b) use DFDL to get to the initial schema
>         Mike wrote and use XSLT afterwards to convert to the desired
>         logical structure. I think there are enough cases where we need
>         the multilayer functionality in DFDL that are relatively simple
>         that we have to have it, which means it will then be possible to
>         deal with complex transformations in DFDL even if not
>         simple/practical.
>          
>           Jim
>          
>          -----Original Message-----
>         *From:* owner-dfdl-wg at ggf.org [mailto:owner-dfdl-wg at ggf.org] *On
>         Behalf Of *mike.beckerle at ascentialsoftware.com
>         *Sent:* Thursday, November 18, 2004 9:53 PM
>         *To:* dfdl-wg at gridforum.org
>         *Subject:* [dfdl-wg] simple way to study hard DFDL example
>         problem - IBM Format VS rec ords as XML
> 
>             I've come up with a way to articulate the difficulties I'm
>             having with DFDL for complex file formats.
>              
>             This problem may not be that hard for someone with more XML,
>             XPath or XQuery experience, so I'd apprecate it if you could
>             look it over and if necessary even run it by your resident
>             XML experts.
>              
>             In case the emailer mangles all the line lengths, I've also
>             attached the below as a file.
>              
>             <!-- Example motivated by DFDL for IBM Format-VS -->
>             <!-- see http://tinyurl.com/3s2bq for details on IBM
>             Format-VS -->
>              
>             <!-- Logically, our data is this: -->
>              
>             <ITEM>The first item</ITEM>
>             <ITEM>This is the second item</ITEM>
>             <ITEM>The third</ITEM>
>              
>             <!-- That is, data having this "logical" schema -->
>              
>             <sequence>
>               <element name="ITEM" type="string" minOccurs="0"
>             maxOccurs="unbounded"/>
>             </sequence>
>              
>             <!-- But the below is the input data were starting from.
>             What you see below simulates
>                  the structural issues of IBM Format-VS, but converting
>             the problem into an XML to XML
>                  transformation problem -->
>              
>             <BLOCK>
>               <SEGMENT>
>                 <WHOLE/> <!-- a WHOLE segment holds a whole item (Duh!).
>             This element is really a type tag. -->
>                 <DATA>The first item</DATA> 
>               </SEGMENT>
>             </BLOCK>
>              
>             <BLOCK>
>               <SEGMENT>
>                 <FIRST/> <!-- a FIRST segment holds the first part of an
>             item. -->
>                 <DATA>Thi</DATA>
>               </SEGMENT>
>             </BLOCK>
>              
>             <BLOCK>
>               <SEGMENT>
>                 <MIDDLE/> <!-- a MIDDLE segment holds data from the
>             center of an item -->
>                 <DATA>s is t</DATA>
>               </SEGMENT>
>             </BLOCK>
>              
>             <BLOCK>
>               <SEGMENT>
>                 <MIDDLE/>
>                 <DATA>he sec</DATA>
>               </SEGMENT>
>             </BLOCK>
>              
>             <BLOCK>
>               <SEGMENT>
>                 <LAST/> <!-- a LAST segment holds data from the end of
>             the item.  -->
>                 <DATA>ond item</DATA>
>               </SEGMENT>
>               <SEGMENT>
>                 <WHOLE/><!-- This second segment in this block is a
>             WHOLE segment. However
>                              in general the 2nd segment of a block could
>             be a WHOLE or the
>                              FIRST segment of another multi-segment
>             multi-block spanning item -->
>                 <DATA>Third item</DATA>
>               </SEGMENT>
>             </BLOCK>
>              
>             <!-- Some observations: -->
>             <!-- Data is organized into BLOCKs -->
>             <!-- Each block contains 1 or 2 SEGMENTs -->
>             <!-- Each SEGMENT is either a WHOLE item, or the item spans
>             2 or more SEGMENTs -->
>             <!-- Spanning data is broken on arbitrary boundaries across
>             segments it spans -->
>             <!-- Spanning involves a FIRST, MIDDLE*, LAST segment
>             structure. -->
>             <!-- MIDDLE* means zero or more MIDDLE segments. -->
>              
>             <!-- The question: how can we express the transformation
>             into the desired logical form?
>                  Or is this beyond the call of duty for DFDL?
>                  Goals include to be as declarative as possible, and
>             ideally, do it as a set of
>                  XML Schema annotations in the GGF DFDL style.  -->
>              
>             <!-- here's an XSD (untested) for the input data structure -->
>              
>             <complexType name="Format_VS_t">
>              <sequence>
>                <element name="BLOCK" type="Block_t" minOccurs="0"
>             maxOccurs="unbounded"/>
>              </sequence>
>             </complexType>
>              
>             <complexType name="Block_t">
>                   <sequence>
>                      <element name="SEGMENT" type="Segment_t"
>             minOccurs="1" maxOccurs="2"/>
>                   </sequence>
>             </complexType>
>              
>             <complexType name="Segment_t">
>              <sequence>
>               <choice>
>                 <element name="WHOLE">
>                 </element>
>                 <element name="FIRST">
>                 </element>
>                 <element name="LAST">
>                 </element>
>                 <element name="MIDDLE">
>                 </element>
>               </choice>
>               <element name="DATA" type="string"/>
>              </sequence>
>             </complexType>
>              
> 
>