[DFDL-WG] DFDL Regular Expression proposal

Alan Powell alan_powell at uk.ibm.com
Wed Apr 16 10:56:40 CDT 2008


Steve


1)

A good summary of full XML schema regular expressions is here 
http://www.xmlschemareference.com/regularExpression.html 

Quick summary of subset

XML Schema - Regular Expression - 

Meta Characters   - all supported

Normal Characters - no restrictions

Single Character Escape Sequence - all supported

Multiple Character Escape Sequences - 

Multiple Character Escape Sequences
Description
.
Any character except '\n' (newline) and '\r' (return).
\s
Whitespace, specifically '' (space), '\t' (tab), '\n' (newline) and 
'\r' (return).
\S
Any character except those matched by '\s'.
\i
The first character in an XML identifier. Specifically, any letter, the 
character '_', or the character ':', See the XML Recommendation for the 
complex specification of a letter. This character represents a subset of 
letter that might appear in '\c'.
\I
Any character except those matched by '\i'.
\c
Any character that might appear in the built-in NMTOKEN datatype. See the 
XML Recommendation for the complex specification of a NameChar.
\C
Any character except those matched by '\c'.
\d
Any Decimal digit. A shortcut for '\p{Nd}'.
\D
Any character except those matched by '\d'.
\w
Any character that might appear in a word. A shortcut for 
'[#X0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]' (all characters except the set of 
"punctuation", "separator", and "other" characters).
\W
Any character except those matched by '\w'.

Character Categories

Character Category
Description
Notes
L
Letter, Any
 
Lu
Letter, Uppercase
 
Ll
Letter, Lowercase
 
Lt
Letter, Titlecase
 
Lm
Letter, Modifier
 
Lo
Letter, Other
 
L
Letter, uppercase, lowercase, and titlecase letters (Lu, Ll, and Lt)
Optional in The Unicode Standard; not supported by the Schema 
Recommendation.
M
Mark, Any
 
Mn
Mark, Nonspacing
 
Mc
Mark, Spacing Combining
 
Me
Mark, Enclosing
 
N
Number, Any
 
Nd
Number, Decimal Digit
 
Nl
Number, Letter
 
No
Number, Other
 
P
Punctuation, Any
 
Pc
Punctuation, Connector
 
Pd
Punctuation, Dash
 
Ps
Punctuation, Open
 
Pe
Punctuation, Close
 
Pi
Punctuation, Initial quote (may behave like Ps or Pe, depending on usage)
 
Pf
Punctuation, Final quote (may behave like Ps or Pe, depending on usage)
 
Po
Punctuation, Other
 
S
Symbol, Any
 
Sm
Symbol, Math
 
Sc
Symbol, Currency
 
Sk
Symbol, Modifier
 
So
Symbol, Other
 
Z
Separator, Any
 
Zs
Separator, Space
 
Zl
Separator, Line
 
Zp
Separator, Paragraph
 
C
Other, Any
 
Cc
Other, Control
 
Cf
Other, Format
 
Cs
Other, Surrogate (not supported by Schema Recommendation).
Explicitly not supported by Schema Recommendation.
Co
Other, Private Use
 
Cn
Other, Not Assigned (no characters in the file have this property).
 


Character Blocks  - None supported (only really meaningful in unicode)

XML Character References - supported


2) - 6) see below



Alan Powell

 MP 211, IBM UK Labs, Hursley,  Winchester, SO21 2JN, England
 Notes Id: Alan Powell/UK/IBM     email: alan_powell at uk.ibm.com 
 Tel: +44 (0)1962 815073                  Fax: +44 (0)1962 816898




From:
Steve Hanson/UK/IBM
To:
Alan Powell/UK/IBM
Cc:
dfdl-wg at ogf.org, mbeckerle.dfdl at gmail.com
Date:
09/04/2008 13:58
Subject:
Re: [DFDL-WG] DFDL Regular Expression proposal


Comments from Steve and Ian:

1) The subset proposed is basically lifted from the IBM MRM parser help. 
If I ever knew what the rationale for the subset was, I don't know it now. 
What features have we excluded? 

2) IBM MRM parser has extended the xsd regular expression syntax to allow 
hexadecimal characters using the following syntax:

\xNN
hexadecimal digits in the range 0 to F

MRM makes much wider use of regular expressions, as an alternative to 
speculative parsing, so I can see why MRM needed this (one concrete use 
case was for TLOG retail messages). Do we need to support this in DFDL? 

Is this documented? I think we need to allow hex characters

3) If we don't add the hex support, what are the use cases for using a 
dfdl:lengthPattern versus using an xsd pattern facet?  It looks like 
pattern facets apply to all supported schema simple types, so not clear 
why dfdl:lengthPattern would be needed. The only use case I can think of 
is where we have length on a complex element or sequence or choice. If 
this is the only use case perhaps dfdl:lengthPattern should only be used 
in those cases?  MRM allows this use. (It might also answer 2 as it allows 
embedded binary data to appear).  Or is there a distinction between 
validation and parsing?

The xsd:pattern operates on the logical contents, Lengthpattern operates 
on the physical contents including markup.

4) What is the behaviour on unparsing?  I believe that MRM simply takes 
the value presented to it and outputs it (it does not attempt to match it 
against the pattern), so DFDL equivalent would be to outout the infoset 
value.

Agree

5) For a repeating element, presumably we would consume only as match as 
the number of occurs dictates.

Good question. I had assumed lengthPattern had the same semantics as 
length.

6) Should state explicitly that DFDL entity references are not allowed. 
The XML character reference is used instead &#xNN;

Need to support DFDL entities to allow x00

Regards, Steve

Steve Hanson
WebSphere Message Brokers
Hursley, UK
Internet: smh at uk.ibm.com
Phone (+44)/(0) 1962-815848




"Mike Beckerle" <mbeckerle.dfdl at gmail.com> 
Sent by: dfdl-wg-bounces at ogf.org
09/04/2008 01:34
Please respond to
mbeckerle.dfdl at gmail.com


To
Alan Powell/UK/IBM at IBMGB, <dfdl-wg at ogf.org>
cc

Subject
Re: [DFDL-WG] DFDL Regular Expression proposal






Suggest add to ?lengthPattern? that the longest possible match is taken. 
This is the usual behavior for regular expressions, but it?s a 
clarification I?ve seen other places.
 

From: dfdl-wg-bounces at ogf.org [mailto:dfdl-wg-bounces at ogf.org] On Behalf 
Of Alan Powell
Sent: Thursday, April 03, 2008 12:44 PM
To: dfdl-wg at ogf.org
Subject: [DFDL-WG] DFDL Regular Expression proposal
 

Attached is the proposal for the regular expression syntax used to 
determine element length. 

Highlights 
Based on the XML Schema regular expression subset used by WebSphere 
Message Broker. 
Only applies to representation = text 
Uses LengthPattern property rather than decorated syntax to distinguish 
from literals and regular expressions as it is only used in one place, 
this avoids everywhere else having to escape the decoration character and 
we are running out of decoration characters. 
Assumes the pattern is converted to the data code page before matching 
against the data stream.



 Comments and improvements as soon as possible please. 


Alan Powell

MP 211, IBM UK Labs, Hursley,  Winchester, SO21 2JN, England
Notes Id: Alan Powell/UK/IBM     email: alan_powell at uk.ibm.com 
Tel: +44 (0)1962 815073                  Fax: +44 (0)1962 816898



 
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 




--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  http://www.ogf.org/mailman/listinfo/dfdl-wg






Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU





-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20080416/a8b82d89/attachment-0001.html 


More information about the dfdl-wg mailing list