[dfdl-wg] Transform examples

Mon Nov 22 11:26:27 CST 2004

I'm just trying to avoid having DFDL's design preclude the creation of an
efficient implementation. It's easy to do this if we're not careful. E.g.,
if we ignore length issues then I can guarantee we won't be able to meet our
previously-stated goals of making efficient random access possible.

I can speak from experience here in that I have a mature implementation of a
system like DFDL, just not an industry standard one. In the implementation
one of the most critical things is the ability to rapidly get to the ends of
records, or fields, without having to process the intervening data. Much
effort in the implementation is devoted to these kinds of length
machinations, and we use a plug in architecture where each data type and all
converters on them can be extended and plugged in anew. For example we
didn't originally have any date/time or dateTime types, but added them
afterwards. All types and converters obey a number of different protocols
associated with how length works. I can't see any other way to do it.

That said, we want the minimum complexity in DFDL at the interface to the
transformations that enables efficient implementation.

...mikeb

  _____  

From: Myers, James D [mailto:jim.myers at pnl.gov] 
Sent: Monday, November 22, 2004 11:51 AM
To: dfdl-wg at gridforum.org
Subject: RE: [dfdl-wg] Transform examples

Just a high level comment (not sure I'd split things up the same way you
have but I won't comment on that now ...): I'm not averse to putting
something in DFDL about the expectations about length of fields, but is this
useful in practice versus simply coding this in the parser you build? The
parser is really the thing that will use it to calculate offsets, so some
methods related to getting offsets on the transform classes are really
what's needed - is that made simpler if there's info in the DFDL transform
description? (And can I safely ignore this info in a dumb reference
implementation that can only calculate offsets by actually parsing and then
counting?)

  Jim

-----Original Message-----
From: owner-dfdl-wg at ggf.org [mailto:owner-dfdl-wg at ggf.org] On Behalf Of
mike.beckerle at ascentialsoftware.com
Sent: Monday, November 22, 2004 11:01 AM
To: Chappell, Alan R; dfdl-wg at gridforum.org
Subject: RE: [dfdl-wg] Transform examples

Alan,

I looked at these examples.

There's one thing I think you've overlooked in the way transforms are
specified here. This is the fact that intFromBinary knows that it will pick
exactly 4 bytes off the input stream, and could advertise that property to
the DFDL "system" in some way, whereas intFromAscii might take anything from
1 to however many characters. E.g., it might be able to tolerate whitespace
of any size, leading zeros, etc. So as a transform it needs to advertise
that the length of data being consumed requires that you run the transform. 

Where I'm coming from is this. It is very important that a DFDL description
of data enable processing the data efficiently. To me that means that if
data is all fixed width, then one should be able to randomly access fields
in the data in constant time. Even if the data is variable width, one should
be able to efficiently skip through it to find the boundaries without
necessarily having to process all the data, convert to common format, etc.

To achieve this, transformations must support determining length and
determining value separately when possible.

There are these things I call "length protocols"

1) FIXED_LENGTH: the length is static in the meaning of the type.  E.g., 4
byte length is implicit in the type "int"

2) STATIC_LENGTH: the length is static as part of the element definition.
E.g.,  12 digit packed decimal known from the Cobol FD. Or a string with
exactly 12 characters. (note that we ignore implications of variable-width
chracter encodings like UTF-8 here on purpose more on that below).

3) OUTSIDE_LENGTH:  the length is dynamic, and comes from elsewhere. I.e.,
consider a stored length prefix field.  We probably don't have to touch the
data to skip past it, for example, though we did have to read the length
field someplace to know how far to skip.

4) PARSE_LENGTH: the length is dynamic, and computing the length of the
element is as hard as computing the value, so you might as well do them both
simultaneously (e.g., delimited text situation)

Now the character set issue. If the character set is fixed width, like
ascii, ebcdic, or UTF-16, then the above apply as defined. If the data
format is text and the character set is variable width, like UTF-8, or
Shift-jis, then 1, 2, and 3 all collapse into 4. I.e., all lengths require
you to parse the characters one by one. However, I'd like this detail to be
pushed down into the DFDL implementation because there are different ways to
do it. E.g., you could do like Java and convert everything to UTF-16 first
and eliminate the whole issue, or you can try to be more clever. 

I think transforms must advertise the protocols they support. E.g., 

intFromBinary in your example supports only FIXED_LENGTH protocol, and it
should say the length is exactly 4.

intFromAscii should support protocols 2, 3, and 4. Only protocol 4 supports
delimiters and their attendant complexities like how embedded delimiters
might be quoted or escaped. This "transform" function must compute both an
integer value, and also compute the length of consumed data in the
underlying stream, or by-side-effect advance the stream to the new position.
The point is not to take a position on whether we manage lengths, or have a
stateful cursor on the stream, the point is that there are 3 functions to
provide. One is parameterized by a static length, One is parameterized by a
dynamic length, and the third is parameterized by delimiters, escape
sequence specifications, etc. All share the numbase parameter.

This all adds baggage, but I think it is necessary or things just can't be
efficient.

...mikeb

  _____  

From: Chappell, Alan R [mailto:chappella at BATTELLE.ORG] 
Sent: Friday, November 19, 2004 4:44 PM
To: dfdl-wg at gridforum.org
Subject: [dfdl-wg] Transform examples

Third try... No zip, just the 3 files important to the simple transform
example....

  _____  

From: Chappell, Alan R 
Sent: Friday, November 19, 2004 1:39 PM
To: dfdl-wg at gridforum.org
Subject: *MJ-REJECTED* Transform examples

Second try on sending these examples. I've cut the set down to the 3
important files so hopefully it will get through this time.

  _____  

 From: Chappell, Alan R 
Sent: Thursday, November 18, 2004 8:47 AM
To: dfdl-wg at gridforum.org
Subject: *MJ-REJECTED* Transform examples

Here is the example I mentioned yesterday. Look particularly at
dfdltransforms.xsd, BasicAsciiIntExp.xsd, and BasicBinIntExp.xsd. Note the
"Exp" on those last two files indicate that they are expansions of the
information in the original versions of those files. These make a first stab
at giving a fully verbose description of the structure and the transforms,
i.e., it's working towards the canonical representation Martin talked about
yesterday. The "dfdltransforms" gives the definitions of transforms and
their components.

There are lots of things that can be improved here. 
<<dfdl-examples.zip>> 
Alan R. Chappell 
chappella at battelle.org 

Pacific Northwest National Laboratory 
Battelle Seattle Research Center 
(206) 528-3228 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20041122/8fbdfea0/attachment.html