[dfdl-wg] tagged data examples

mike.beckerle at ascentialsoftware.com mike.beckerle at ascentialsoftware.com
Thu Apr 7 09:26:49 CDT 2005


DFDLers,

Suman and I were discussing a particular data format problem that I
undertook to solve in DFDL. I thought it would be good to bring the
discussion to the whole group.

The problem is tagged data fields. Many computer messaging formats use
things like this for some or all of their data fields.

Here are two records containing a first and last name in a tagged format:
 
    fname:Tim!LName:Stewart; 
    LASTNAME:Smith!firstName:Tom;
 
Notes: the tags have varying forms, i.e., fname, firstname firstName,
FIRSTNAME, FNAME, all are accepted as the tag for the first name field. The
definition here is that it is case insensitive and either fname or firstname
forms. Similar for lastname. Also the tagged fields can appear in any order,
and are optional.
 
Here's my test file showing how the XML comes out: (I've attached these as
files also in case the email system hammers them.)
 
This is testTaggedData1.xml

<?xml version="1.0" encoding="iso-8859-1"?>
<!-- Xerces-J fails if you put an internal DTD here so you can use Entity
defs. Too bad. -->
<dfdlTest xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns="http://dataformat.org/testCase"
xmlns:tc="http://dataformat.org/testCase"
  xsi:schemaLocation="http://dataformat.org/testCase ../../xsd/testCase.xsd

                      http://dataformat.org/tests testTaggedData1.dfdl.xsd">
  <inputTest>
    
    <!-- 
      Tagged data example from Suman Kalia of IBM.
      Each record consists of a first and last name. 
      Each name is tagged, so they can appear in either order. 
      Furthermore, the tagging scheme is too complex to implement with
something simple 
      like an initiator, though if we allowed initiators to be regexps that
would be able to express this example. 
    -->
    
    <data kind="text">fname:Tim!LName:Stewart; 
LASTNAME:Smith!firstName:Tom;
</data>
    <dfdlSchema file="testTaggedData1.dfdl.xsd"/>
    <tc:xmlResult xmlns="http://dataformat.org/tests">
      <myData>
        <custInfo>
          <firstName>Tim</firstName>
          <lastName>Stewart</lastName>
        </custInfo>
        <custInfo>
          <firstName>Tom</firstName>
          <lastName>Smith</lastName>
        </custInfo>
      </myData>
    </tc:xmlResult>
  </inputTest>
</dfdlTest>

 
Now, the DFDL itself 
 
There are 3 variants here. I'll start with the simplest one. You get a very
simple DFDL for this if you assume you can have (a) a way to specify the
values of initiators, terminators, and separators as regular expressions (b)
support for xsd:all groups.

This is testTaggedData3.dfdl.xsd

(ok, this one has long lines, so the email system is sure to hammer it, so I
won't inline it here.)

I think this particular example is pretty straightforward. 

However, I have two other example DFDL schemas for this which make fewer
assumptions.

testTaggedData2.dfdl.xsd still allows one to specify regular expressions for
the initiator rep property, but does not allow use of xsd:all. Which is a
construct I *was* trying to avoid because, well, it's complicated and feels
non-primitive. I think you'll agree the complexity goes up significantly.

testTaggedData1.dfdl.xsd eliminates specifying the initiator at all, and
specifies the tags by way of an additional field hidden in a hidden-layer
which has value constrained by an XSD pattern facet to match a specific
regular expression. It also does not use xsd:all.

My summary from going through this exercise: We need both xsd:all support,
and regular expressions for initiators, and all delimiters. I'm happy that
one can express these things without needing these constructs, but tagged
representations are too commonplace for this much complex construction to be
required. The complex constructions I used in testTaggedData1 and
testTaggedData2 would only be needed if the tags were complex formatted
entities the format of which couldn't be handled by a regular expression.  

Note that if the tags are actually not case insensitive, but are really
fixed strings, then there is no need for the regular expression capability.
I'm not sure where we should draw the line here. I'm comfortable with
xsd:all support, and plain strings as delimiters or with regexps as
delimiters. 

...mikeb





-------------- next part --------------
A non-text attachment was scrubbed...
Name: testTaggedData1.xml
Type: application/octet-stream
Size: 1334 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/dfdl-wg/attachments/20050407/320f2cf3/attachment.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testTaggedData3.dfdl.xsd
Type: application/octet-stream
Size: 3019 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/dfdl-wg/attachments/20050407/320f2cf3/attachment-0001.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testTaggedData2.dfdl.xsd
Type: application/octet-stream
Size: 4183 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/dfdl-wg/attachments/20050407/320f2cf3/attachment-0002.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testTaggedData1.dfdl.xsd
Type: application/octet-stream
Size: 6766 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/dfdl-wg/attachments/20050407/320f2cf3/attachment-0003.obj 


More information about the dfdl-wg mailing list