[dfdl-wg] tagged data examples

Steve Hanson smh at uk.ibm.com
Thu Apr 7 11:56:18 CDT 2005





Very good questions. My thoughts, based on the experience with the parser
we use with message broker.

Whenever initiators (I will call them tags as it's less typing :) are used,
that means the order of fields can be varied, and fields can be omitted.
Not surprisingly customers exploit this. Any parser claiming to support the
use of text tags in data must therefore support unordered data (ie,
xsd:all) and missing data. No different from XML instance documents really.

Our parser provides for specifying a single fixed tag (case sensitive) for
a field. If a tag could vary in case, or have an alternative form, as your
example shows, we would fall back to using a regular expression. But in our
case everything matched by the regular expression is treated as data. This
latter behaviour is not what you want in this scenario, as the tag ends up
being treated as data and anything subsequently processing the data must
strip off the tag. The way round this is as you say to allow just the
initiator to be specified using a regular expression. However we have not
received an explicit requirement for this (yet).

I wasn't sure how to read the 'or' in your last sentence. Personally for
DFDL 1.0 I think that xsd:all support is a must, but that we could probably
get away with a single fixed string for a tag, perhaps accompanied by a
'case sensitive' property. However, regular expression support in general
is required in order to distinguish data where there is no tag - you can't
parse a SWIFT message without it, for example. So maybe allowing a tag to
be specified with a regular expression is not a big deal and we should
include it in 1.0 anyway.

Final thought on modeling your example. If you know that you will always
get either fname & lname, or firstname & lastname, then you could model
this as an xsd:choice of two xsd:all groups where each group contained the
same child xsd:elements, but with different (fixed) dfdl tags. Regular
expression not needed. Obviously this does not scale well and  many users
do not like having to add extra 'layers' to their models in this way.

Regards, Steve

Steve Hanson
WebSphere Business Integration Brokers,
IBM Hursley, England
Internet: smh at uk.ibm.com
Phone (+44)/(0) 1962-815848


                                                                           
             mike.beckerle at asc                                             
             entialsoftware.co                                             
             m                                                          To 
             Sent by:                  dfdl-wg at gridforum.org               
             owner-dfdl-wg at ggf                                          cc 
             .org                                                          
                                                                   Subject 
                                       [dfdl-wg] tagged data examples      
             07/04/2005 15:26                                              
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           





DFDLers,

Suman and I were discussing a particular data format problem that I
undertook to solve in DFDL. I thought it would be good to bring the
discussion to the whole group.

The problem is tagged data fields. Many computer messaging formats use
things like this for some or all of their data fields.

Here are two records containing a first and last name in a tagged format:

    fname:Tim!LName:Stewart;
    LASTNAME:Smith!firstName:Tom;

Notes: the tags have varying forms, i.e., fname, firstname firstName,
FIRSTNAME, FNAME, all are accepted as the tag for the first name field. The
definition here is that it is case insensitive and either fname or
firstname
forms. Similar for lastname. Also the tagged fields can appear in any
order,
and are optional.

Here's my test file showing how the XML comes out: (I've attached these as
files also in case the email system hammers them.)

This is testTaggedData1.xml

<?xml version="1.0" encoding="iso-8859-1"?>
<!-- Xerces-J fails if you put an internal DTD here so you can use Entity
defs. Too bad. -->
<dfdlTest xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns="http://dataformat.org/testCase"
xmlns:tc="http://dataformat.org/testCase"
  xsi:schemaLocation="http://dataformat.org/testCase ../../xsd/testCase.xsd

                      http://dataformat.org/tests
testTaggedData1.dfdl.xsd">
  <inputTest>

    <!--       Tagged data example from Suman Kalia of IBM.      Each
record consists of a first and last name.       Each name is tagged, so
they can appear in either order.       Furthermore, the tagging scheme is
too complex to implement withsomething simple       like an initiator,
though if we allowed initiators to be regexps thatwould be able to express
this example.     -->

    <data kind="text">fname:Tim!LName:Stewart;
LASTNAME:Smith!firstName:Tom;
</data>
    <dfdlSchema file="testTaggedData1.dfdl.xsd"/>
    <tc:xmlResult xmlns="http://dataformat.org/tests">
      <myData>
        <custInfo>
          <firstName>Tim</firstName>
          <lastName>Stewart</lastName>
        </custInfo>
        <custInfo>
          <firstName>Tom</firstName>
          <lastName>Smith</lastName>
        </custInfo>
      </myData>
    </tc:xmlResult>
  </inputTest>
</dfdlTest>


Now, the DFDL itself

There are 3 variants here. I'll start with the simplest one. You get a very
simple DFDL for this if you assume you can have (a) a way to specify the
values of initiators, terminators, and separators as regular expressions
(b)
support for xsd:all groups.

This is testTaggedData3.dfdl.xsd

(ok, this one has long lines, so the email system is sure to hammer it, so
I
won't inline it here.)

I think this particular example is pretty straightforward.

However, I have two other example DFDL schemas for this which make fewer
assumptions.

testTaggedData2.dfdl.xsd still allows one to specify regular expressions
for
the initiator rep property, but does not allow use of xsd:all. Which is a
construct I *was* trying to avoid because, well, it's complicated and feels
non-primitive. I think you'll agree the complexity goes up significantly.

testTaggedData1.dfdl.xsd eliminates specifying the initiator at all, and
specifies the tags by way of an additional field hidden in a hidden-layer
which has value constrained by an XSD pattern facet to match a specific
regular expression. It also does not use xsd:all.

My summary from going through this exercise: We need both xsd:all support,
and regular expressions for initiators, and all delimiters. I'm happy that
one can express these things without needing these constructs, but tagged
representations are too commonplace for this much complex construction to
be
required. The complex constructions I used in testTaggedData1 and
testTaggedData2 would only be needed if the tags were complex formatted
entities the format of which couldn't be handled by a regular expression.

Note that if the tags are actually not case insensitive, but are really
fixed strings, then there is no need for the regular expression capability.
I'm not sure where we should draw the line here. I'm comfortable with
xsd:all support, and plain strings as delimiters or with regexps as
delimiters.

...mikeb





(See attached file: testTaggedData1.xml)(See attached file:
testTaggedData3.dfdl.xsd)(See attached file: testTaggedData2.dfdl.xsd)(See
attached file: testTaggedData1.dfdl.xsd)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testTaggedData1.xml
Type: application/octet-stream
Size: 1367 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/dfdl-wg/attachments/20050407/12910e93/attachment.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testTaggedData3.dfdl.xsd
Type: application/octet-stream
Size: 3095 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/dfdl-wg/attachments/20050407/12910e93/attachment-0001.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testTaggedData2.dfdl.xsd
Type: application/octet-stream
Size: 4296 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/dfdl-wg/attachments/20050407/12910e93/attachment-0002.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testTaggedData1.dfdl.xsd
Type: application/octet-stream
Size: 6964 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/dfdl-wg/attachments/20050407/12910e93/attachment-0003.obj 


More information about the dfdl-wg mailing list