[dfdl-wg] Transform examples

Myers, James D jim.myers at pnl.gov
Mon Nov 22 10:51:02 CST 2004


Just a high level comment (not sure I'd split things up the same way you
have but I won't comment on that now ...): I'm not averse to putting
something in DFDL about the expectations about length of fields, but is
this useful in practice versus simply coding this in the parser you
build? The parser is really the thing that will use it to calculate
offsets, so some methods related to getting offsets on the transform
classes are really what's needed - is that made simpler if there's info
in the DFDL transform description? (And can I safely ignore this info in
a dumb reference implementation that can only calculate offsets by
actually parsing and then counting?)
 
  Jim
 
-----Original Message-----
From: owner-dfdl-wg at ggf.org [mailto:owner-dfdl-wg at ggf.org] On Behalf Of
mike.beckerle at ascentialsoftware.com
Sent: Monday, November 22, 2004 11:01 AM
To: Chappell, Alan R; dfdl-wg at gridforum.org
Subject: RE: [dfdl-wg] Transform examples



	 
	Alan,
	 
	I looked at these examples.
	 
	There's one thing I think you've overlooked in the way
transforms are specified here. This is the fact that intFromBinary knows
that it will pick exactly 4 bytes off the input stream, and could
advertise that property to the DFDL "system" in some way, whereas
intFromAscii might take anything from 1 to however many characters.
E.g., it might be able to tolerate whitespace of any size, leading
zeros, etc. So as a transform it needs to advertise that the length of
data being consumed requires that you run the transform. 
	 
	Where I'm coming from is this. It is very important that a DFDL
description of data enable processing the data efficiently. To me that
means that if data is all fixed width, then one should be able to
randomly access fields in the data in constant time. Even if the data is
variable width, one should be able to efficiently skip through it to
find the boundaries without necessarily having to process all the data,
convert to common format, etc.
	 
	To achieve this, transformations must support determining length
and determining value separately when possible.
	 
	There are these things I call "length protocols"
	 
	1) FIXED_LENGTH: the length is static in the meaning of the
type.  E.g., 4 byte length is implicit in the type "int"
	 
	2) STATIC_LENGTH: the length is static as part of the element
definition. E.g.,  12 digit packed decimal known from the Cobol FD. Or a
string with exactly 12 characters. (note that we ignore implications of
variable-width chracter encodings like UTF-8 here on purpose more on
that below).
	 
	3) OUTSIDE_LENGTH:  the length is dynamic, and comes from
elsewhere. I.e., consider a stored length prefix field.  We probably
don't have to touch the data to skip past it, for example, though we did
have to read the length field someplace to know how far to skip.
	 
	4) PARSE_LENGTH: the length is dynamic, and computing the length
of the element is as hard as computing the value, so you might as well
do them both simultaneously (e.g., delimited text situation)
	 
	Now the character set issue. If the character set is fixed
width, like ascii, ebcdic, or UTF-16, then the above apply as defined.
If the data format is text and the character set is variable width, like
UTF-8, or Shift-jis, then 1, 2, and 3 all collapse into 4. I.e., all
lengths require you to parse the characters one by one. However, I'd
like this detail to be pushed down into the DFDL implementation because
there are different ways to do it. E.g., you could do like Java and
convert everything to UTF-16 first and eliminate the whole issue, or you
can try to be more clever. 
	 
	I think transforms must advertise the protocols they support.
E.g., 
	 
	intFromBinary in your example supports only FIXED_LENGTH
protocol, and it should say the length is exactly 4.
	 
	intFromAscii should support protocols 2, 3, and 4. Only protocol
4 supports delimiters and their attendant complexities like how embedded
delimiters might be quoted or escaped. This "transform" function must
compute both an integer value, and also compute the length of consumed
data in the underlying stream, or by-side-effect advance the stream to
the new position.  The point is not to take a position on whether we
manage lengths, or have a stateful cursor on the stream, the point is
that there are 3 functions to provide. One is parameterized by a static
length, One is parameterized by a dynamic length, and the third is
parameterized by delimiters, escape sequence specifications, etc. All
share the numbase parameter.
	 
	This all adds baggage, but I think it is necessary or things
just can't be efficient.
	 
	...mikeb
	 
	 

________________________________

		From: Chappell, Alan R [mailto:chappella at BATTELLE.ORG] 
		Sent: Friday, November 19, 2004 4:44 PM
		To: dfdl-wg at gridforum.org
		Subject: [dfdl-wg] Transform examples
		
		
		Third try... No zip, just the 3 files important to the
simple transform example....

________________________________

		From: Chappell, Alan R 
		Sent: Friday, November 19, 2004 1:39 PM
		To: dfdl-wg at gridforum.org
		Subject: *MJ-REJECTED* Transform examples
		
		
		Second try on sending these examples. I've cut the set
down to the 3 important files so hopefully it will get through this
time.

		
________________________________

		 From: Chappell, Alan R 
		Sent: Thursday, November 18, 2004 8:47 AM
		To: dfdl-wg at gridforum.org
		Subject: *MJ-REJECTED* Transform examples

		Here is the example I mentioned yesterday. Look
particularly at dfdltransforms.xsd, BasicAsciiIntExp.xsd, and
BasicBinIntExp.xsd. Note the "Exp" on those last two files indicate that
they are expansions of the information in the original versions of those
files. These make a first stab at giving a fully verbose description of
the structure and the transforms, i.e., it's working towards the
canonical representation Martin talked about yesterday. The
"dfdltransforms" gives the definitions of transforms and their
components.

		There are lots of things that can be improved here. 
		<<dfdl-examples.zip>> 
		Alan R. Chappell 
		chappella at battelle.org 

		Pacific Northwest National Laboratory 
		Battelle Seattle Research Center 
		(206) 528-3228 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20041122/27f4be6a/attachment.htm 


More information about the dfdl-wg mailing list