[DFDL-WG] Ignore extraneous CRLF w/ space?

Steve Hanson smh at uk.ibm.com
Fri Jun 7 10:14:31 EDT 2013


I've been trying to work out a way to do this in a single pass, but so far 
no luck. I think your conclusion is correct and it requires two passes.

The ability to handle formats like this is a goal of DFDL, it just didn't 
make the cut for 1.0 of the spec. 

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848



From:   "Garriss Jr., James P." <jgarriss at mitre.org>
To:     "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, 
Date:   07/06/2013 13:18
Subject:        Re: [DFDL-WG] Ignore extraneous CRLF w/ space?
Sent by:        dfdl-wg-bounces at ogf.org



I have received no response from this.  :-(
 
Should I take this to mean that folding whitespace (FWS) in email headers 
is a problem that DFDL cannot handle?  I think this is a reasonable 
conclusion, as FWS is conceptually identical to IMF comments, which we 
have also identified as a problem that DFDL cannot handle.  I’m not trying 
to cast stones at DFDL, just trying to understand its limitations.
 
The solution for these two problems (and for encoded words as well) is for 
DFDL to support multiple passes over some (or maybe all) of the data.  Is 
this a feature that is being considered for DFDL?
 
From: Garriss Jr., James P. 
Sent: Wednesday, June 05, 2013 1:23 PM
To: dfdl-wg at ogf.org
Subject: RE: [DFDL-WG] Ignore extraneous CRLF w/ space?
 
Ok, so the good news is that I completely understand what you’re talking 
about now.  Thanks for the example and explanation (with correction).
 
The bad news is that I don’t see how this helps me.  IOW, I now have an 
“array” of “data” elements, but I need to validate the actual data.  I 
should be breaking the header into “from” data, “by” data, “via” data, 
etc., with “date” data at the end.  Something kinda like this:
 
  <Received>
    <Tokens>
      <From>smtpksrv1.mitre.org (localhost.localdomain [127.0.0.1]) 
</From>
      <By>localhost (Postfix)</By>
      <Via>Exchange Front-End Server webmail.afmc.af.mil 
([131.28.34.85])</Via>
      <With>SMTP</With>
      <Id>0A8791F116E</Id>
     <For><jgarriss at mitre.org></For>
    </Tokens>
    <DateTime>
      <DateTimeStuff>
        <DayOfTheWeek>Tue</DayOfTheWeek>
                          …more day time stuff here…
      </DateTimeStuff>
    </DateTime>
  </Received>
 
I think we are saying this is another 2-pass problem.  In other words, the 
data is this (only 1 CRLF at the end):
 
Received: from smtpksrv1.mitre.org (localhost.localdomain [127.0.0.1]) by 
localhost (Postfix) via Exchange Front-End Server webmail.afmc.af.mil 
([131.28.34.85]) with SMTP id 0A8791F116E for <jgarriss at mitre.org>; Tue, 4 
Jun 2013 14:03:24 -0400 (EDT)
 
But IMF adds what it calls “folding whitespace” to break it into multiple 
lines (see http://tools.ietf.org/html/rfc5322#section-3.2.2), like this:
 
Received: from smtpksrv1.mitre.org (localhost.localdomain [127.0.0.1]) 
 by localhost (Postfix) via Exchange Front-End Server webmail.afmc.af.mil 
 ([131.28.34.85]) with SMTP id 0A8791F116E for <jgarriss at mitre.org>; Tue,
  4 Jun 2013 14:03:24 -0400 (EDT)
 
So DFDL needs 2 passes to correctly parse the data, one to remove the 
CRLFs and another to parse/validate the data.  (If you recall, Mike, we’ve 
found that DFDL has the same problem with IMF comments and encoded words.) 
 And DFDL can’t do multiple passes.
 
That right?
 
If that’s right, this is a serious problem, because *many* headers use 
folding whitespace (though Received is probably the most important one). 
It’s a pretty core concept for IMF. 
 
 
From: Steve Hanson [mailto:smh at uk.ibm.com] 
Sent: Wednesday, June 05, 2013 12:44 PM
To: Garriss Jr., James P.
Cc: dfdl-wg at ogf.org; dfdl-wg-bounces at ogf.org
Subject: Re: [DFDL-WG] Ignore extraneous CRLF w/ space?
 
Received: from smtpksrv1.mitre.org (localhost.localdomain [127.0.0.1]) 
by localhost (Postfix) via Exchange Front-End Server webmail.afmc.af.mil 
([131.28.34.85]) with SMTP id 0A8791F116E for <jgarriss at mitre.org>; Tue, 
 4 Jun 2013 14:03:24 -0400 (EDT) 

<xs:element name="Received_Header" dfdl:initiator="Received:%WSP*;" 
dfdl:terminator="%CR;%LF"> 
  <xs:complexType> 
    <xs:sequence dfdl:separator="%CR;%LF;%SP;" 
dfdl:separatorPosition="infix"> 
        <xs:element name="data" type="xs:string" maxOccurs="unbounded" 
dfdl:lengthKind="delimited" /> 
    </xs:sequence> 
  </xs:complexType> 
</xs:element> 

DFDL consumes the initiator then starts processing the content of the 
header as an array of records. The CR+LF+SP are consumed as the separator, 
because that is the longest match. The CR+LF (no SP) is consumed as the 
terminator of the header. Clearly that only works if there is no SP 
straight after the CR+LF for the first line of a header. So you don't need 
a discriminator. 

You will have to stitch the data together post-parse. I guess you could 
make the sequence hidden and get DFDL to stitch together the data lines 
into one long string via an element with dfdl:inputValueCalc.   

Ah - I think I see where Mike's earlier append to the mailing list was 
coming from ? 

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848 



From:        "Garriss Jr., James P." <jgarriss at mitre.org> 
To:        "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, 
Date:        05/06/2013 16:25 
Subject:        Re: [DFDL-WG] Ignore extraneous CRLF w/ space? 
Sent by:        dfdl-wg-bounces at ogf.org 




> Is the problem that the dfdl:terminator '%CR;%LF;' for the end of the 
header record is firing prematurely when it encounters the CRLF in the 
data? 
  
Exactly. 
  
> I would model the data as unbounded repeating records, and use a 
discriminator to distinguish the repeats from the next header. 
  
Uh, could you repeat that in English?  Maybe with a small example?  I 
freely admit that I don’t understand what you just said.  Thanks! 
  
From: Steve Hanson [mailto:smh at uk.ibm.com] 
Sent: Wednesday, June 05, 2013 5:21 AM
To: Garriss Jr., James P.
Cc: dfdl-wg at ogf.org; dfdl-wg-bounces at ogf.org
Subject: Re: [DFDL-WG] Ignore extraneous CRLF w/ space? 
  
James 

Is the problem that the dfdl:terminator '%CR;%LF;' for the end of the 
header record is firing prematurely when it encounters the CRLF in the 
data? 

If so then I'm not sure that DFDL can ignore the extra %CR;%LF; without 
using an escape scheme - but there isn't an escape scheme to use. 

I would model the data as unbounded repeating records, and use a 
discriminator to distinguish the repeats from the next header. 

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848 



From:        "Garriss Jr., James P." <jgarriss at mitre.org> 
To:        "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, 
Date:        04/06/2013 19:56 
Subject:        [DFDL-WG] Ignore extraneous CRLF w/ space? 
Sent by:        dfdl-wg-bounces at ogf.org 





Long IMF headers, such as Received, can be wrapped onto the next line by 
using a CRLF and then a space.  This example has 3 such wrappings: 
 
Received: from smtpksrv1.mitre.org (localhost.localdomain [127.0.0.1]) 
by localhost (Postfix) via Exchange Front-End Server webmail.afmc.af.mil 
([131.28.34.85]) with SMTP id 0A8791F116E for <jgarriss at mitre.org>; Tue, 
 4 Jun 2013 14:03:24 -0400 (EDT) 
 
How do I get DFDL to ignore these wrappings?  For most of the header, it’s 
not an issue, because I can use a lengthPattern to lookahead to the ; 
before the date starts.  But once the date starts, I have no way of 
knowing when it ends, so I need to ignore any CRLF with a space. 
 
TIA 
 
--
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
--
 dfdl-wg mailing list
 dfdl-wg at ogf.org
 https://www.ogf.org/mailman/listinfo/dfdl-wg 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130607/513b69ff/attachment-0001.html>


More information about the dfdl-wg mailing list