Fw: [dfdl-wg] Transform examples

Steve Hanson smh at uk.ibm.com
Wed Nov 24 05:03:33 CST 2004





I will second Mike's points here.  As discussed on one of our calls,  DFDL
'transforms' are the equivalent to a set of IBM model properties. The key
information needed to correctly and efficiently parse a string are, in our
model:

- Physical type (fixed length, length prefixed, null terminated, etc - same
as Mike's length protocols)
- Length count (if fixed length)
- Length reference (name of another field in the message giving the length
- alternative to length count)
- Ccsid
- Length units (handles the character set issue Mike noted below, in
conjunction with the ccsid)
- Justification
- Padding character (these last two used because we will auto-strip/add
padding chars on read/write)

Regards, Steve


----- Forwarded by Steve Hanson/UK/IBM on 24/11/2004 10:52 -----
                                                                           
             mike.beckerle at asc                                             
             entialsoftware.co                                             
             m                                                          To 
             Sent by:                  jim.myers at pnl.gov,                  
             owner-dfdl-wg at ggf         dfdl-wg at gridforum.org               
             .org                                                       cc 
                                                                           
                                                                   Subject 
             22/11/2004 17:26          RE: [dfdl-wg] Transform examples    
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           





I'm just trying to avoid having DFDL's design preclude the creation of an
efficient implementation. It's easy to do this if we're not careful. E.g.,
if we ignore length issues then I can guarantee we won't be able to meet
our previously-stated goals of making efficient random access possible.

I can speak from experience here in that I have a mature implementation of
a system like DFDL, just not an industry standard one. In the
implementation one of the most critical things is the ability to rapidly
get to the ends of records, or fields, without having to process the
intervening data. Much effort in the implementation is devoted to these
kinds of length machinations, and we use a plug in architecture where each
data type and all converters on them can be extended and plugged in anew.
For example we didn't originally have any date/time or dateTime types, but
added them afterwards. All types and converters obey a number of different
protocols associated with how length works. I can't see any other way to do
it.

That said, we want the minimum complexity in DFDL at the interface to the
transformations that enables efficient implementation.

...mikeb

 From: Myers, James D [mailto:jim.myers at pnl.gov]
 Sent: Monday, November 22, 2004 11:51 AM
 To: dfdl-wg at gridforum.org
 Subject: RE: [dfdl-wg] Transform examples

 Just a high level comment (not sure I'd split things up the same way you
 have but I won't comment on that now ...): I'm not averse to putting
 something in DFDL about the expectations about length of fields, but is
 this useful in practice versus simply coding this in the parser you build?
 The parser is really the thing that will use it to calculate offsets, so
 some methods related to getting offsets on the transform classes are
 really what's needed - is that made simpler if there's info in the DFDL
 transform description? (And can I safely ignore this info in a dumb
 reference implementation that can only calculate offsets by actually
 parsing and then counting?)

   Jim

 -----Original Message-----
 From: owner-dfdl-wg at ggf.org [mailto:owner-dfdl-wg at ggf.org] On Behalf Of
 mike.beckerle at ascentialsoftware.com
 Sent: Monday, November 22, 2004 11:01 AM
 To: Chappell, Alan R; dfdl-wg at gridforum.org
 Subject: RE: [dfdl-wg] Transform examples


 Alan,

 I looked at these examples.

 There's one thing I think you've overlooked in the way transforms are
 specified here. This is the fact that intFromBinary knows that it will
 pick exactly 4 bytes off the input stream, and could advertise that
 property to the DFDL "system" in some way, whereas intFromAscii might take
 anything from 1 to however many characters. E.g., it might be able to
 tolerate whitespace of any size, leading zeros, etc. So as a transform it
 needs to advertise that the length of data being consumed requires that
 you run the transform.

 Where I'm coming from is this. It is very important that a DFDL
 description of data enable processing the data efficiently. To me that
 means that if data is all fixed width, then one should be able to randomly
 access fields in the data in constant time. Even if the data is variable
 width, one should be able to efficiently skip through it to find the
 boundaries without necessarily having to process all the data, convert to
 common format, etc.

 To achieve this, transformations must support determining length and
 determining value separately when possible.

 There are these things I call "length protocols"

 1) FIXED_LENGTH: the length is static in the meaning of the type.  E.g., 4
 byte length is implicit in the type "int"

 2) STATIC_LENGTH: the length is static as part of the element definition.
 E.g.,  12 digit packed decimal known from the Cobol FD. Or a string with
 exactly 12 characters. (note that we ignore implications of variable-width
 chracter encodings like UTF-8 here on purpose more on that below).

 3) OUTSIDE_LENGTH:  the length is dynamic, and comes from elsewhere. I.e.,
 consider a stored length prefix field.  We probably don't have to touch
 the data to skip past it, for example, though we did have to read the
 length field someplace to know how far to skip.

 4) PARSE_LENGTH: the length is dynamic, and computing the length of the
 element is as hard as computing the value, so you might as well do them
 both simultaneously (e.g., delimited text situation)

 Now the character set issue. If the character set is fixed width, like
 ascii, ebcdic, or UTF-16, then the above apply as defined. If the data
 format is text and the character set is variable width, like UTF-8, or
 Shift-jis, then 1, 2, and 3 all collapse into 4. I.e., all lengths require
 you to parse the characters one by one. However, I'd like this detail to
 be pushed down into the DFDL implementation because there are different
 ways to do it. E.g., you could do like Java and convert everything to
 UTF-16 first and eliminate the whole issue, or you can try to be more
 clever.

 I think transforms must advertise the protocols they support. E.g.,

 intFromBinary in your example supports only FIXED_LENGTH protocol, and it
 should say the length is exactly 4.

 intFromAscii should support protocols 2, 3, and 4. Only protocol 4
 supports delimiters and their attendant complexities like how embedded
 delimiters might be quoted or escaped. This "transform" function must
 compute both an integer value, and also compute the length of consumed
 data in the underlying stream, or by-side-effect advance the stream to the
 new position.  The point is not to take a position on whether we manage
 lengths, or have a stateful cursor on the stream, the point is that there
 are 3 functions to provide. One is parameterized by a static length, One
 is parameterized by a dynamic length, and the third is parameterized by
 delimiters, escape sequence specifications, etc. All share the numbase
 parameter.

 This all adds baggage, but I think it is necessary or things just can't be
 efficient.

 ...mikeb


  From: Chappell, Alan R [mailto:chappella at BATTELLE.ORG]
  Sent: Friday, November 19, 2004 4:44 PM
  To: dfdl-wg at gridforum.org
  Subject: [dfdl-wg] Transform examples

  Third try... No zip, just the 3 files important to the simple transform
  example....

  From: Chappell, Alan R
  Sent: Friday, November 19, 2004 1:39 PM
  To: dfdl-wg at gridforum.org
  Subject: *MJ-REJECTED* Transform examples

  Second try on sending these examples. I've cut the set down to the 3
  important files so hopefully it will get through this time.

   From: Chappell, Alan R
  Sent: Thursday, November 18, 2004 8:47 AM
  To: dfdl-wg at gridforum.org
  Subject: *MJ-REJECTED* Transform examples


  Here is the example I mentioned yesterday. Look particularly at
  dfdltransforms.xsd, BasicAsciiIntExp.xsd, and BasicBinIntExp.xsd. Note
  the "Exp" on those last two files indicate that they are expansions of
  the information in the original versions of those files. These make a
  first stab at giving a fully verbose description of the structure and the
  transforms, i.e., it’s working towards the canonical representation
  Martin talked about yesterday. The "dfdltransforms" gives the definitions
  of transforms and their components.


  There are lots of things that can be improved here.
  <<dfdl-examples.zip>>
  Alan R. Chappell
  chappella at battelle.org


  Pacific Northwest National Laboratory
  Battelle Seattle Research Center
  (206) 528-3228



More information about the dfdl-wg mailing list