[dfdl-wg] CSV string worked example

Westhead, Martin (Martin) westhead at avaya.com
Wed Mar 1 08:41:34 CST 2006


Hi Jim,

 

I think the others should be implicitly used because of order and type.

 

Sorry tokenizer should be split - unpropagated change.

 

Chartostring (which I called concatenate) is to be used first (because
it is the more specific match).

 

EOS is up for grabs I was thinking of it as a returned value (e.g. -1)
but an exception might (or might not) be easier to make sense of.

 

Regarding the new model. I don't think this is a problem at the level of
your example. We could simply use a single sequence and a more complex
"split" conversion. I imagine that the "split" conversion we would want
to settle on should accept a regular expression (or at least a list of
separators). In your example you just have to allow the separator to be
a new line OR a comma and you are done.

 

A note here this is intended as a rough sketch not a finished design. I
am expecting the details to need to be worked out here. In particular I
think Mike/IBM have some fairly complex ideas for
separator/terminator/initiator/escape that we will have to try to seat
in this framework.

 

Thanks,

 

Martin

 

  _____  

From: Jim Myers [mailto:jimmyers at ncsa.uiuc.edu] 
Sent: Wednesday, March 01, 2006 3:49 AM
To: Westhead, Martin (Martin); dfdl-wg at ggf.org
Subject: Re: [dfdl-wg] CSV string worked example

 

Martin - two types of comments - things I think are
typos/inconsistencies and an alternate logic:

Clarifications:
are the initial definitions on the top element defining an order to use
subsequently or are they just there for us to see what you've defined?
Of the four there, you only explicitly (in a comment?) invoke one - are
the others implicit because of the order?
You use dfdl:tokenizer as a conversion later - is that supposed to be
split as well?
bytetochar is used implicitly before the first split?
chartostring is used implicitly before stringtoint which is implicitly
used to get the int element?
is EOS a returned value (and therefore of the type being returned) or is
it an exception?

Logical - what happens if the rows are not in the logical model -
physically there are 10 rows with 5 elements, but the logical model is
50 ints in a single sequence. To support this, you'd need to have both
tokenization steps in one sequence annotation with two separate split
separators - does the use of setLocal for split separator work in this
case? (Is this how byteorder is now used?)
Thinking about missing values - is it clear how a missing row versus a
missing element is now handled (I think so) - the split conversion using
comma can define a default input to use if the stream it recieves is
empty (from a \n\n pair) and the stringtoint conversion can do likewise
to cover a ,, pair.

  Jim


At 09:25 PM 2/28/2006, Westhead, Martin (Martin) wrote:



Hi Folks,
 
I have tried to work through the CSV example that Mike suggested a
couple of weeks ago. It has turned up some interesting issues which I
have tried to address. These are less about making the underlying
semantics work and more about providing a seamless default set up that
makes the easy things work just as you would like.
 
I was pushed for time on this so I apologies if this is unclear in
places, but I wanted to put it out before tomorrow's meeting.
 
Thanks,
 
Martin

James D. Myers
Associate Director, Cyberenvironments and Technologies, NCSA
1205 W. Clark St, MC-257
Urbana, IL 61801
217-244-1934
jimmyers at ncsa.uiuc.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20060301/a318ff13/attachment.html 


More information about the dfdl-wg mailing list