[DFDL-WG] Can I ignore data I don't want in DFDL?

Steve Hanson smh at uk.ibm.com
Fri Mar 1 09:02:39 EST 2013


The general solution in DFDL is to use the combination of an optional 
repeating element inside a hidden group.

You need to be careful that this optional hidden element does not consume 
the next piece of wanted data by mistake. If all the unwanted elements 
have known initiators then you are ok. If you don't know the initiators, 
but know what is coming next, then one approach is as follows:

<xs:complexType>
        <xs:sequence>
                <xs:element name="From" type="NameType" 
dfdl:initiator="From:%WSP*;" terminator="%NL;%WSP*;" />
                <xs:element name="To" type="NameType" 
dfdl:initiator="To:%WSP*;" terminator="%NL;%WSP*;"/>
                <xs:sequence dfdl:hiddenGroupRef="UnwantedGroup" />
                <xs:element name="Subject" type="xs:string" 
dfdl:initiator="Subject:%WSP*;" terminator="%NL;%WSP*;"/>
        </xs:sequence>
</xs:complexType>

<xs:group name="UnwantedGroup>
        <xs:sequence>
                <xs:element name="UnwantedHeaders" maxOccurs="unbounded" 
/>
                        <xs:complexType>
                                <xs:sequence>
                                        <xs:element name="Unwanted" 
type="xs:string" terminator="%NL;%WSP*;">
 <xsd:annotation><xsd:appinfo source="http://www.ogf.org/dfdl/">
 <dfdl:discriminator test="{fn:not(fn:startWith("Subject:"))}"/>
 </xsd:appinfo></xsd:annotation>
                                        </xs:element>
                                </xs:sequence>
                        </xs:complexType>
                </xs:element>
        </xs:sequence>
</xs:complexType>

The hidden loop should consume all header lines that do not start with 
"Subject:" and stop when it reaches one that does. 

I've used a terminator for the header lines, you may have used a separator 
with separatorPolicy 'suppressed'. Either should work, but the terminator 
gives you the opportunity to handle data where the final CRLF is missing 
(via property dfdl:documentFinalTerminatorCanBeMissing).

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848



From:   "Garriss Jr., James P." <jgarriss at mitre.org>
To:     "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, 
Date:   01/03/2013 13:15
Subject:        [DFDL-WG] Can I ignore data I don't want in DFDL?
Sent by:        dfdl-wg-bounces at ogf.org



Suppose I am using DFDL to parse email headers.  Suppose the RFC only 
allows 3 headers:  To, From, Subject.  DFDL can handle this, no problem.
 
But suppose I get an email that includes a 4th header, one I have not 
planned for (i.e., have not included in the DFDL schema), don’t care 
about, and don’t want in the infoset.  Like so:
 
From: <john at doe.com>
To: <jane at doe.com>
Keywords:  sales                       <-- this line should be ignored!
Subject:  Latest sales figures
 
Can DFDL handle this?  Does it have a mechanism for allowing me to ignore 
(and thus drop) data I haven’t planned for and don’t care about?--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130301/f2414638/attachment-0001.html>


More information about the dfdl-wg mailing list