[DFDL-WG] terminate by next field's initiator aka lengthKind="endAtStartOfNext" or something like that
Steve Hanson
smh at uk.ibm.com
Tue Jun 4 11:57:40 EDT 2013
Try modelling the < as an 'infix' separator, with suppression policy
'anyEmpty' which allows for it to be absent when the preceding field is
empty. And model the > as part of the terminator of the record, so a list
'%NL; >%NL;'. Then speculative parsing will then sort out 1 and 2, I
hope.
Regards
Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
From: Steve Hanson/UK/IBM
To: Mike Beckerle <mbeckerle.dfdl at gmail.com>,
Cc: dfdl-wg at ogf.org, dfdl-wg-bounces at ogf.org
Date: 04/06/2013 15:02
Subject: Re: [DFDL-WG] terminate by next field's initiator aka
lengthKind="endAtStartOfNext" or something like that
MicroSoft's RTF example:
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0
Calibri;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard
\sa200\sl276\slmult1\lang9\f0\fs22\par
Line 1: xxxx\par
\b Line 2:\b0 yyyy\par
}
Elements are delimited by the \ (either initiator or prefix separator) of
simple fields, or by the { (initiator) of complex fields.
Regards
Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
From: Mike Beckerle <mbeckerle.dfdl at gmail.com>
To: dfdl-wg at ogf.org,
Date: 04/06/2013 14:47
Subject: [DFDL-WG] terminate by next field's initiator aka
lengthKind="endAtStartOfNext" or something like that
Sent by: dfdl-wg-bounces at ogf.org
I know we omitted this from DFDL v1.0 (I am quite sure I advocated that
position), and we're too late to add it back now, but while theoretically
possible I had never seen this before, but now I have seen it and I'm
wondering if it is more common than I originally thought.
The situation is this. I have an element. It wants to be delimited in that
it has an escape scheme, and it is delimited by something in the
common-sense of the word, but the terminator is actually what one thinks
of as the initiator of the next element.
It comes up in Internet Message Format headers as one example:
Reply-To: joe at foo.com
Reply-To: <joe at foo.com>
Reply-To: joe smith<joe at foo.com>
Reply-To: "joe <Mr. XML> smith"<joe at foo.com>
Reply-To: <>
In the 3rd and fourth case, there is no terminator, just the required <
which begins the next field.
Modeling this whole reply-to construct requires a choice of several
different elements which model the different formats. For example I see no
way to model a format which accepts either line one or line 2 of the above
without using a choice. That said, my real concern is with lines 3 and 4.
The natural model for lines 3 and 4 (and perhaps 5) seems like it should
be a display-name field followed by an email address field. The "<" really
does not want to be used in some situations as the terminator of the prior
field and in others as the initiator of the next field. That affects reuse
of the validation regex's, etc.
Right now the only way to model this is for the display name field to use
a regex which re-invents the escape-scheme-like behavior of the optional
quotation mark surround, and uses regex lookahead to sense the "<" when it
appears unescaped, without consuming it.
That's not too bad really, but I am curious what others have seen out
there in the world of data that also has this idiom where a string is
delimited by a unique structure at the beginning of the next element.
Do we have collective knowledge of several more such formats, or have we
all just seen this same IMF header example as the motivation.
--
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
--
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130604/59ac9c2e/attachment-0001.html>
More information about the dfdl-wg
mailing list