[DFDL-WG] hard case example - delimiter is whatever found after first field

Wed Aug 5 09:36:19 CDT 2009

We've been worrying about a supposed hard case that I think is not very
hard.

Consider a DFDL schema that is for a file like this:

field1;field2;field3
field1;fie./ld2;fi./eld3
field1.fi/;eld2.fi;/eld3
field1/fi.;eld2/fie;.ld3
field1;fi/.eld2;field./3

Each record contains three fields, all strings. Delimiter is either ".",
";", or "/" depending on what is in the data.

The first field can be unambiguously parsed. It ends in one of ".", ";", or
"/" and cannot contain any of those 3. The second and third field are
separated by whatever was used to terminate the first field.

The subsequent fields need to use the actual delimiter that was found after
field one because they are allowed to contain the other two delimiters as
content, as illustrated in the example above where field2 and field3 are
broken up with those characters.

To handle this I suggest a schema something like this:

<element name="delim" type="string" dfdl:lengthKind="pattern"
   dfdl:lengthPattern="[\.|\;|\/]"/>

<element name="record">
  <complexType>
  <sequence>
    <element name="f1" type="string" dfdl:lengthKind="pattern"
       dfdl:lengthPattern="(^[\.|\;|\/])*"/> <!-- notice pattern excludes
the possible delimiters -->
    <sequence>
      <annotation><appinfo>
         <dfdl:hidden ref="delim"/>
      </appinfo></annotation>
    </sequence>
    <sequence dfdl:separator="{ ../delim }" dfdl:terminator="\n">
      <element name="f2" type="string" dfdl:lengthKind="delimited"/>
      <element name="f3" type="string" dfdl:lengthKind="delimited"/>
    </sequence>
  </sequence>
  </complexType>
</element>

The above record uses a regexp to pick off the first field excluding all
possible delimiters.

Then a hidden field picks off the actual delimiter that is found.

Subsequently there is a sequence, whose separator is specified by
referencing the hidden field. This works exactly the way any computed
delimiter works. The "delim" field is, in essence, a header field specifying
the delimiter.

The cost of this in complexity is that that we have to specify the potential
set of delimiters in two regular expression patterns. For a case like this I
have no problem with this minor complexity.

I think this can be made to work for parsing. Some details (properties) are
missing of course, but the concept should be clear. For an obscure case like
this, I think this is very preferable to yet another keyword in DFDL.

For output, I think an output value calc would be needed to figure out the
value for the delim field. We would need functions in the expression library
to examine the strings in the infoset of field2 and field3 for the possible
delimiter characters so that on output we could figure out whether to use
".", ";", or "/" as the delim element's value. I don't know if our proposed
function library includes the necessary functions.

Do we need to concern ourselves with unparsing/writing out this kind of
format for DFDL v1.0, or is parsing enough?

...mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20090805/5680e35a/attachment.html