[DFDL-WG] Action 193: Second draft for errata for RegEx
Steve Hanson
smh at uk.ibm.com
Mon Feb 4 08:22:14 EST 2013
The errata for action 193 has been updated below, please review for next
WG call.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 04/02/2013 12:50 -----
From: Mike Beckerle <mbeckerle.dfdl at gmail.com>
To: Steve Hanson/UK/IBM at IBMGB,
Cc: dfdl-wg at ogf.org
Date: 28/01/2013 13:49
Subject: Re: [DFDL-WG] Action 193: First draft for errata for RegEx
To me this is excellent work, much appreciated. I'd like to be much more
directing about the non-portable constructs.
We should decide among only these choices:
1) the non-portable constructs are disallowed. It is an SDE to use them.
The check is required for all compliant DFDL implementations (that
implement regular expressions at all.)
2) the non-portable constructs are allowed, but not recommended, and DFDL
implementations are *required* to issue non-portability warnings if these
constructs are used.
Not checking this, hoping for the best, user-beware, is a bad idea. A
scanner to find these syntaxes and disallow them is pretty easy to write.
Regular expressions are, by their very nature, not very rich. You
implement an escape scheme, anything else you scan for appearance of the
offending constructs. Ironically, it's something that can be done with a
regular expression itself.
...mike
On Mon, Jan 28, 2013 at 8:22 AM, Steve Hanson <smh at uk.ibm.com> wrote:
Here's a draft errata for action 193, for review on the next WG call.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
============================================================
Section 24 to read as follows:
A DFDL regular expression may be specified for the dfdl:lengthPattern
format property and the
dfdl:testPattern attribute of the dfdl:assert and dfdl:disciminator
annotations. DFDL regular
expressions do not interpret DFDL entities.
A DFDL regular expression is defined by a set of valid pattern characters.
For portability,
a DFDL regular expression pattern is restricted to the inclusive subset of
the ICU regular
expression [ICURE] and the Java(R) 7 regular expression [JAVARE] with the
Unicode flags UNICODE_CASE and UNICODE_CHARACTER_CLASS turned on. The
following regular expression
constructs are not common to both ICU and Java(R) 7 and it is a schema
definition error if
any are used in a DFDL regular expression:
*Construct* *Meaning*
*Notes*
\N{UNICODE CHARACTER NAME} Match the named character
ICU only
\X Match a Grapheme Cluster
ICU only
\Uhhhhhhhh Match the character with the hex value
hhhhhhhh. ICU only
(?# ... ) Free-format comment
ICU only
(?w-w) UREGEX_UWORD - Controls the behaviour of \b in
ICU only
a pattern.
(?d-d) UNIX_LINES - Enables Unix lines mode.
Java 7 only
(?u-u) UNICODE_CASE - Enables Unicode-aware case
folding. Java 7 only (1)
(?U-U) UNICODE_CHARACTER_CLASS - Enables the Unicode
Java 7 only (2)
version of Predefined character classes and
POSIX
character classes.
(?imsx-imsx:X) X, as a non-capturing group with the given
flags. Java 7 only
Note that the flags i,s,m,x are valid, but
appending :X to the flag is not.
Notes:
(1) Implementations using Java 7 must set flag UNICODE_CASE by default to
match ICU:
(2) Implementations using Java 7 must set flag UNICODE_CHARACTER_CLASS by
default to match ICU:
Additionally, the behaviour of the word character construct (\w) is not
consistent in ICU and Java 7. In Java 7 \w is
[\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}],
which is a larger set than ICU where \w is
[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].
The use of \w is not recommended in DFDL regular expressions in
conjunction with Unicode
encodings, and an implementation must issue a warning if such usage is
detected.
Character properties are detailed by the Unicode Regular Expressions
[UNICODERE].
Section 30 to add:
[ICURE] - http://userguide.icu-project.org/strings/regexp
[JAVARE] -
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
[UNICODERE] - http://www.unicode.org/reports/tr18/
Section 30 to remove:
[PERLRE] - http://perldoc.perl.org/perlre.html#Extended-Patterns
[JAVARE] -
http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
--
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
--
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130204/2855bd98/attachment-0001.html>
More information about the dfdl-wg
mailing list