Fw: [dfdl-wg] simple way to study hard DFDL example problem - IBM Format VS rec ords as XML

Fri Nov 19 10:17:27 CST 2004

I agree with Jim that two DFDL layers are required, one that describes the
original logical structure and one to describe the desired logical
structure. The key thing to recognise is that there are two logical
structures here, and that a transformation of some kind (XSL, Java program,
...) is required to get one from the other.

I don't think we should get DFDL to treat IBM Format VS records as a purely
physical representation of some ideal logical structure - that gets way too
complicated and imposes a big burden on all DFDL implementations.

This is a pretty subjective area - it poses the philisophical question
"when does the physical format become so cryptic that it can be viewed as
changing the logical structure itself".

A structure that asks the same question is an IMS segment. These impose
themselves on the data such that the data is carved into segments that are
preceded with an LLZZ field, the LL containing the segment length. Do you
view the logical structure as  a sequence of segments, or do you view it as
the content of the segments where the owning segment # is a physical
property of each field?  On a project I worked on in the past, we took the
latter view, which meant that this IMS specific concept found its way into
the physical model, and we had to write specific code to parse & write
segments. I am not convinced that was the right decision.

Mike you say you are aware of 19 such legacy formats, and I bet there are
more. Well IBM's broker has no specific support for any of these, nor have
we been asked to incorporate them into our message model. Maybe we should
play the percentages game - if we see enough different subsystems that use
the same cryptic format then it becomes worth building the support into
DFDL.

Regards, Steve

Steve Hanson
WebSphere Business Integration Brokers,
IBM Hursley, England
Internet: smh at uk.ibm.com
Phone (+44)/(0) 1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 19/11/2004 16:13 -----

             mike.beckerle at asc                                             
             entialsoftware.co                                             
             m                                                          To 
             Sent by:                  jim.myers at pnl.gov,                  
             owner-dfdl-wg at ggf         dfdl-wg at gridforum.org               
             .org                                                       cc 

                                                                   Subject 
             19/11/2004 15:43          RE: [dfdl-wg] simple way to study   
                                       hard DFDL example problem - IBM     
                                       Format VS rec   ords as XML         

You are thinking along the lines I was; however, the challenge is that I
cannot find a way to do this using multilayer so I'm uncomfortable
suggesting that it's possible at all anymore. Here's some reasoning why.

In particular, it's the intersection of the induction across the items with
the first, middle*, last thing, and the spanning that seems to defy my
efforts to cut it up into progressive transformation layer by layer. In
some conversations I've referred to this problem as the "non-conforming
trees" problem. The fundamental shapes of the trees are not compatible, and
expressing the transformation between them isn't easily done via induction
of any kind on one or the other of the trees.

To me the First, Middle*, Last thing is very problematic. It's effectively
a little regular language (in the formal sense) that has to be recognized.
Generally this requires a finite-state-machine, and what makes FSMs
interesting and complex is always the way you diagnose malformed data in
addition to recognizing correct data.

Now, a finite-state-machine is, to my mind, the ultimate procedural
abstraction, the quintessential opposite of "declarative" expression. To be
declarative about a FSM you end up saying "recognize this regular
language", and providing a description of the regular language, which is of
course, just begging the question of how it actually works.

(And for us, we're not really talking about a regular language of character
text, but a pattern of usage in the binary data layout that obeys the
pattern of a regular language. So it's not like having a little regular
expression thing for validating text strings helps with this problem.)

I guess I'm arguing that a black box approach to this is not only
acceptable, but is highly likely to be the only "good" way to do it. In
light of this I've suggested a rep property called "streamFormat" (perhaps
should be renamed "recordFormat"), which gets values from the set VS, V,
VBS, FB, FBS, etc. etc. all these well-defined legacy data formats (there
are 19 of them I think).  In additon, one should be able to extend this by
introduction of a blackbox transformation.

And ... here's the rub...if that's true for this case, then other "hard"
examples like run-length encoding seem also in this category.

There's several "leaps of faith" just made in these arguments, so i'd still
like people to take this "XML challenge" and see if there's some magic I'm
overlooking.

...mikeb

 From: Myers, James D [mailto:jim.myers at pnl.gov]
 Sent: Friday, November 19, 2004 9:52 AM
 To: dfdl-wg at gridforum.org
 Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem - IBM
 Format VS rec ords as XML

 Without digging too much into the details, I'd say this is an example
 where multi-layer comes in. The DFDL would describe a hidden layer in
 which the first, middle, last data elements would be identified and put
 into a list, and then that hidden list would be used as the input to
 create items in the output layer.

 I think this is conceptually similar to one of our run-length encoding
 examples (more complex of course). If you read a sequence if ints and then
 a sequence of floats and need to output a sequence of floats with int[i]
 repeats of float[i], it would be easiest to create a hidden layer
 representing the int and float sequences and to then produce output from
 that. If you don't think about a layer, even this example gets painful - I
 need to read an int, skip forward somewhere to find a float, skip back to
 get the next int, etc.

 Mike's full example, not starting with the XML-ized version, might be
 something that requires more than one layer - read the original into
 something with with XML schema Mike defines, then a layer making a
 sequence of data elements, and then something that has the desired logical
 output.

 I guess I would claim that this would not be too bad a way to describe a
 fairly complex format in terms of a fairly different logical structure.
 Whether one *should* do this in DFDL, or whether it would make more sense
 to a) write a black box parser to get to items, or b) use DFDL to get to
 the initial schema Mike wrote and use XSLT afterwards to convert to the
 desired logical structure. I think there are enough cases where we need
 the multilayer functionality in DFDL that are relatively simple that we
 have to have it, which means it will then be possible to deal with complex
 transformations in DFDL even if not simple/practical.

   Jim

  -----Original Message-----
 From: owner-dfdl-wg at ggf.org [mailto:owner-dfdl-wg at ggf.org] On Behalf Of
 mike.beckerle at ascentialsoftware.com
 Sent: Thursday, November 18, 2004 9:53 PM
 To: dfdl-wg at gridforum.org
 Subject: [dfdl-wg] simple way to study hard DFDL example problem - IBM
 Format VS rec ords as XML

 I've come up with a way to articulate the difficulties I'm having with
 DFDL for complex file formats.

 This problem may not be that hard for someone with more XML, XPath or
 XQuery experience, so I'd apprecate it if you could look it over and if
 necessary even run it by your resident XML experts.

 In case the emailer mangles all the line lengths, I've also attached the
 below as a file.

 <!-- Example motivated by DFDL for IBM Format-VS -->
 <!-- see http://tinyurl.com/3s2bq for details on IBM Format-VS -->

 <!-- Logically, our data is this: -->

 <ITEM>The first item</ITEM>
 <ITEM>This is the second item</ITEM>
 <ITEM>The third</ITEM>

 <!-- That is, data having this "logical" schema -->

 <sequence>
   <element name="ITEM" type="string" minOccurs="0" maxOccurs="unbounded"/>
 </sequence>

 <!-- But the below is the input data were starting from. What you see
 below simulates
      the structural issues of IBM Format-VS, but converting the problem
 into an XML to XML
      transformation problem -->

 <BLOCK>
   <SEGMENT>
     <WHOLE/> <!-- a WHOLE segment holds a whole item (Duh!). This element
 is really a type tag. -->
     <DATA>The first item</DATA>
   </SEGMENT>
 </BLOCK>

 <BLOCK>
   <SEGMENT>
     <FIRST/> <!-- a FIRST segment holds the first part of an item. -->
     <DATA>Thi</DATA>
   </SEGMENT>
 </BLOCK>

 <BLOCK>
   <SEGMENT>
     <MIDDLE/> <!-- a MIDDLE segment holds data from the center of an item
 -->
     <DATA>s is t</DATA>
   </SEGMENT>
 </BLOCK>

 <BLOCK>
   <SEGMENT>
     <MIDDLE/>
     <DATA>he sec</DATA>
   </SEGMENT>
 </BLOCK>

 <BLOCK>
   <SEGMENT>
     <LAST/> <!-- a LAST segment holds data from the end of the item.  -->
     <DATA>ond item</DATA>
   </SEGMENT>
   <SEGMENT>
     <WHOLE/><!-- This second segment in this block is a WHOLE segment.
 However
                  in general the 2nd segment of a block could be a WHOLE or
 the
                  FIRST segment of another multi-segment multi-block
 spanning item -->
     <DATA>Third item</DATA>
   </SEGMENT>
 </BLOCK>

 <!-- The question: how can we express the transformation into the desired
 logical form?
      Or is this beyond the call of duty for DFDL?
      Goals include to be as declarative as possible, and ideally, do it as
 a set of
      XML Schema annotations in the GGF DFDL style.  -->

 <!-- here's an XSD (untested) for the input data structure -->

 <complexType name="Format_VS_t">
  <sequence>
    <element name="BLOCK" type="Block_t" minOccurs="0"
 maxOccurs="unbounded"/>
  </sequence>
 </complexType>

 <complexType name="Block_t">
       <sequence>
          <element name="SEGMENT" type="Segment_t" minOccurs="1"
 maxOccurs="2"/>
       </sequence>
 </complexType>

 <complexType name="Segment_t">
  <sequence>
   <choice>
     <element name="WHOLE">
     </element>
     <element name="FIRST">
     </element>
     <element name="LAST">
     </element>
     <element name="MIDDLE">
     </element>
   </choice>
   <element name="DATA" type="string"/>
  </sequence>
 </complexType>