[DFDL-WG] Action 193: First draft for errata for RegEx

Steve Hanson smh at uk.ibm.com
Mon Jan 28 08:22:36 EST 2013


Here's a draft errata for action 193, for review on the next WG call.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848

============================================================

Section 24 to read as follows:

A DFDL regular expression may be specified for the dfdl:lengthPattern 
format property and the
dfdl:testPattern attribute of the dfdl:assert and dfdl:disciminator 
annotations.  DFDL regular 
expressions do not interpret DFDL entities.

A DFDL regular expression is defined by a set of valid pattern characters. 
 For portability, 
it is recommended that the regular expression pattern is restricted to the 
inclusive subset 
of the ICU regular expression [ICURE] and the Java(R) 7 regular expression 
[JAVARE] with the
Unicode character classes flag (UNICODE_CHARACTER_CLASS) turned on.  The 
following regular expression 
constructs are not common to both ICU and Java(R) 7 and are not 
recommended in a DFDL regular 
expression:
 
*Construct*                *Meaning*     *Notes*
\N{UNICODE CHARACTER NAME}  Match the named character      ICU only

\X                          Match a Grapheme Cluster      ICU only

\Uhhhhhhhh                  Match the character with the hex value 
hhhhhhhh.    ICU only

(?# ... )                   Free-format comment      ICU only

(?w-w)                      UREGEX_UWORD - Controls the behaviour of \b in 
     ICU only
                            a pattern.

(?d-d)                      UNIX_LINES - Enables Unix lines mode.  Java 7 
only

(?u-u)                      UNICODE_CASE - Enables Unicode-aware case 
folding.  Java 7 only -
      always on for
      DFDL

(?U-U)                      UNICODE_CHARACTER_CLASS - Enables the Unicode  
    Java 7 only -
                            version of Predefined character classes and 
POSIX   always on for
                            character classes.      DFDL

(?imsx-imsx:X)              X, as a non-capturing group with the given 
flags.   Java 7 only
                            Note that the flags i,s,m,x are valid, but 
                            appending :X to the flag is not.

Additionally, the behaviour of the word character construct (\w) is not 
consistent in ICU and Java(R) 7, 
and is not recommended. In Java (R) 7 \w is 
[\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}],
which is a larger set than ICU where \w is 
[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]. 
Character properties are detailed by the Unicode Regular Expressions 
[UNICODERE].


Section 30 to add:
[ICURE]     - http://userguide.icu-project.org/strings/regexp
[JAVARE]    - 
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
[UNICODERE] - http://www.unicode.org/reports/tr18/

Section 30 to remove:
[PERLRE] - http://perldoc.perl.org/perlre.html#Extended-Patterns
[JAVARE] - 
http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html


Andy 
Andy Edwards - WebSphere Message Broker - DFDL


Email:
andy.edwards at uk.ibm.com
Snail Mail: 
MP211, Hursley park, Hursley, WINCHESTER, Hants, SO21 2JN
Tel int:
247222
Tel ext:
+44 (0)1962 817222
Desk:
DE2 U20

The Feynman problem solving Algorithm
  1) Write down the problem
  2) Think real hard
  3) Write down the answer
 -- Murray Gell-mann in the NY Times

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130128/d55813f7/attachment.html>


More information about the dfdl-wg mailing list