[DFDL-WG] Action 193: Second draft for errata for RegEx

Steve Hanson smh at uk.ibm.com
Mon Feb 4 08:22:14 EST 2013


The errata for action 193 has been updated below, please review for next 
WG call.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 04/02/2013 12:50 -----




From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     Steve Hanson/UK/IBM at IBMGB, 
Cc:     dfdl-wg at ogf.org
Date:   28/01/2013 13:49
Subject:        Re: [DFDL-WG] Action 193: First draft for errata for RegEx




To me this is excellent work, much appreciated. I'd like to be much more 
directing about the non-portable constructs. 

We should decide among only these choices:

1) the non-portable constructs are disallowed. It is an SDE to use them. 
The check is required for all compliant DFDL implementations (that 
implement regular expressions at all.)

2) the non-portable constructs are allowed, but not recommended, and DFDL 
implementations are *required* to issue non-portability warnings if these 
constructs are used.

Not checking this, hoping for the best, user-beware, is a bad idea. A 
scanner to find these syntaxes and disallow them is pretty easy to write. 
Regular expressions are, by their very nature, not very rich. You 
implement an escape scheme, anything else you scan for appearance of the 
offending constructs. Ironically, it's something that can be done with a 
regular expression itself. 

...mike

On Mon, Jan 28, 2013 at 8:22 AM, Steve Hanson <smh at uk.ibm.com> wrote:
Here's a draft errata for action 193, for review on the next WG call. 

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848 

============================================================ 

Section 24 to read as follows: 

A DFDL regular expression may be specified for the dfdl:lengthPattern 
format property and the 
dfdl:testPattern attribute of the dfdl:assert and dfdl:disciminator 
annotations.  DFDL regular 
expressions do not interpret DFDL entities. 

A DFDL regular expression is defined by a set of valid pattern characters. 
 For portability, 
a DFDL regular expression pattern is restricted to the inclusive subset of 
the ICU regular
expression [ICURE] and the Java(R) 7 regular expression [JAVARE] with the 
Unicode flags UNICODE_CASE and UNICODE_CHARACTER_CLASS turned on. The 
following regular expression
constructs are not common to both ICU and Java(R) 7 and it is a schema 
definition error if
any are used in a DFDL regular expression: 
  
*Construct*                *Meaning*                                       
    *Notes* 
\N{UNICODE CHARACTER NAME}  Match the named character                     
      ICU only 

\X                          Match a Grapheme Cluster                       
     ICU only 

\Uhhhhhhhh                  Match the character with the hex value 
hhhhhhhh.    ICU only 

(?# ... )                   Free-format comment                           
      ICU only 

(?w-w)                      UREGEX_UWORD - Controls the behaviour of \b in 
     ICU only 
                            a pattern. 

(?d-d)                      UNIX_LINES - Enables Unix lines mode.         
      Java 7 only 

(?u-u)                      UNICODE_CASE - Enables Unicode-aware case 
folding.  Java 7 only (1) 

(?U-U)                      UNICODE_CHARACTER_CLASS - Enables the Unicode 
      Java 7 only (2) 
                            version of Predefined character classes and 
POSIX    
                            character classes.                             
      

(?imsx-imsx:X)              X, as a non-capturing group with the given 
flags.   Java 7 only 
                            Note that the flags i,s,m,x are valid, but 
                            appending :X to the flag is not. 

Notes:
(1) Implementations using Java 7 must set flag UNICODE_CASE by default to 
match ICU:
(2) Implementations using Java 7 must set flag UNICODE_CHARACTER_CLASS by 
default to match ICU:

Additionally, the behaviour of the word character construct (\w) is not 
consistent in ICU and Java 7. In Java 7 \w is 
[\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}], 
which is a larger set than ICU where \w is 
[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].   
The use of \w is not recommended in DFDL regular expressions in 
conjunction with Unicode
encodings, and an implementation must issue a warning if such usage is 
detected. 
Character properties are detailed by the Unicode Regular Expressions 
[UNICODERE]. 


Section 30 to add: 
[ICURE]     - http://userguide.icu-project.org/strings/regexp 
[JAVARE]    - 
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html 
[UNICODERE] - http://www.unicode.org/reports/tr18/ 

Section 30 to remove: 
[PERLRE] - http://perldoc.perl.org/perlre.html#Extended-Patterns 
[JAVARE] - 
http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html 



Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg



-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130204/2855bd98/attachment-0001.html>


More information about the dfdl-wg mailing list