[DFDL-WG] terminate by next field's initiator aka lengthKind="endAtStartOfNext" or something like that

Tue Jun 4 10:02:11 EDT 2013

MicroSoft's RTF example:

{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 
Calibri;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard
\sa200\sl276\slmult1\lang9\f0\fs22\par
Line 1: xxxx\par
\b Line 2:\b0  yyyy\par
}

Elements are delimited by the \ (either initiator or prefix separator) of 
simple fields, or by the { (initiator) of complex fields.

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848

From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     dfdl-wg at ogf.org, 
Date:   04/06/2013 14:47
Subject:        [DFDL-WG] terminate by next field's initiator aka 
lengthKind="endAtStartOfNext" or something like that
Sent by:        dfdl-wg-bounces at ogf.org

I know we omitted this from DFDL v1.0 (I am quite sure I advocated that 
position), and we're too late to add it back now, but while theoretically 
possible I had never seen this before, but now I have seen it and I'm 
wondering if it is more common than I originally thought. 

The situation is this. I have an element. It wants to be delimited in that 
it has an escape scheme, and it is delimited by something in the 
common-sense of the word, but the terminator is actually what one thinks 
of as the initiator of the next element.

It comes up in Internet Message Format headers as one example:

Reply-To: joe at foo.com
Reply-To: <joe at foo.com>
Reply-To: joe smith<joe at foo.com>
Reply-To: "joe <Mr. XML> smith"<joe at foo.com>
Reply-To: <>

In the 3rd and fourth case, there is no terminator, just the required < 
which begins the next field. 

Modeling this whole reply-to construct requires a choice of several 
different elements which model the different formats. For example I see no 
way to model a format which accepts either line one or line 2 of the above 
without using a choice. That said, my real concern is with lines 3 and 4.

The natural model for lines 3 and 4 (and perhaps 5) seems like it should 
be a display-name field followed by an email address field. The "<" really 
does not want to be used in some situations as the terminator of the prior 
field and in others as the initiator of the next field. That affects reuse 
of the validation regex's, etc.

Right now the only way to model this is for the display name field to use 
a regex which re-invents the escape-scheme-like behavior of the optional 
quotation mark surround, and uses regex lookahead to sense the "<" when it 
appears unescaped, without consuming it.  

That's not too bad really, but I am curious what others have seen out 
there in the world of data that also has this idiom where a string is 
delimited by a unique structure at the beginning of the next element. 

Do we have collective knowledge of several more such formats, or have we 
all just seen this same IMF header example as the motivation. 

-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com
--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130604/98e100b0/attachment.html>