[DFDL-WG] DFDL Regular Expression proposal
Steve Hanson
smh at uk.ibm.com
Wed Apr 9 07:58:49 CDT 2008
Comments from Steve and Ian:
1) The subset proposed is basically lifted from the IBM MRM parser help.
If I ever knew what the rationale for the subset was, I don't know it now.
What features have we excluded?
2) IBM MRM parser has extended the xsd regular expression syntax to allow
hexadecimal characters using the following syntax:
\xNN
hexadecimal digits in the range 0 to F
MRM makes much wider use of regular expressions, as an alternative to
speculative parsing, so I can see why MRM needed this (one concrete use
case was for TLOG retail messages). Do we need to support this in DFDL?
3) If we don't add the hex support, what are the use cases for using a
dfdl:lengthPattern versus using an xsd pattern facet? It looks like
pattern facets apply to all supported schema simple types, so not clear
why dfdl:lengthPattern would be needed. The only use case I can think of
is where we have length on a complex element or sequence or choice. If
this is the only use case perhaps dfdl:lengthPattern should only be used
in those cases? MRM allows this use. (It might also answer 2 as it allows
embedded binary data to appear). Or is there a distinction between
validation and parsing?
4) What is the behaviour on unparsing? I believe that MRM simply takes
the value presented to it and outputs it (it does not attempt to match it
against the pattern), so DFDL equivalent would be to outout the infoset
value.
5) For a repeating element, presumably we would consume only as match as
the number of occurs dictates.
6) Should state explicitly that DFDL entity references are not allowed.
The XML character reference is used instead &#xNN;
Regards, Steve
Steve Hanson
WebSphere Message Brokers
Hursley, UK
Internet: smh at uk.ibm.com
Phone (+44)/(0) 1962-815848
"Mike Beckerle" <mbeckerle.dfdl at gmail.com>
Sent by: dfdl-wg-bounces at ogf.org
09/04/2008 01:34
Please respond to
mbeckerle.dfdl at gmail.com
To
Alan Powell/UK/IBM at IBMGB, <dfdl-wg at ogf.org>
cc
Subject
Re: [DFDL-WG] DFDL Regular Expression proposal
Suggest add to ?lengthPattern? that the longest possible match is taken.
This is the usual behavior for regular expressions, but it?s a
clarification I?ve seen other places.
From: dfdl-wg-bounces at ogf.org [mailto:dfdl-wg-bounces at ogf.org] On Behalf
Of Alan Powell
Sent: Thursday, April 03, 2008 12:44 PM
To: dfdl-wg at ogf.org
Subject: [DFDL-WG] DFDL Regular Expression proposal
Attached is the proposal for the regular expression syntax used to
determine element length.
Highlights
Based on the XML Schema regular expression subset used by WebSphere
Message Broker.
Only applies to representation = text
Uses LengthPattern property rather than decorated syntax to distinguish
from literals and regular expressions as it is only used in one place,
this avoids everywhere else having to escape the decoration character and
we are running out of decoration characters.
Assumes the pattern is converted to the data code page before matching
against the data stream.
Comments and improvements as soon as possible please.
Alan Powell
MP 211, IBM UK Labs, Hursley, Winchester, SO21 2JN, England
Notes Id: Alan Powell/UK/IBM email: alan_powell at uk.ibm.com
Tel: +44 (0)1962 815073 Fax: +44 (0)1962 816898
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
--
dfdl-wg mailing list
dfdl-wg at ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20080409/547e7a51/attachment.html
More information about the dfdl-wg
mailing list