[DFDL-WG] Action 193: First draft for errata for RegEx

Mike Beckerle mbeckerle.dfdl at gmail.com
Mon Jan 28 08:49:07 EST 2013


To me this is excellent work, much appreciated. I'd like to be much more
directing about the non-portable constructs.

We should decide among only these choices:

1) the non-portable constructs are disallowed. It is an SDE to use them.
The check is required for all compliant DFDL implementations (that
implement regular expressions at all.)

2) the non-portable constructs are allowed, but not recommended, and DFDL
implementations are *required* to issue non-portability warnings if these
constructs are used.

Not checking this, hoping for the best, user-beware, is a bad idea. A
scanner to find these syntaxes and disallow them is pretty easy to write.
Regular expressions are, by their very nature, not very rich. You implement
an escape scheme, anything else you scan for appearance of the offending
constructs. Ironically, it's something that can be done with a regular
expression itself.

...mike

On Mon, Jan 28, 2013 at 8:22 AM, Steve Hanson <smh at uk.ibm.com> wrote:

> Here's a draft errata for action 193, for review on the next WG call.
>
> Regards
>
> Steve Hanson
> Architect, Data Format Description Language (DFDL)
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> IBM SWG, Hursley, UK*
> **smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
>
> ============================================================
>
> *Section 24 to read as follows:*
>
> A DFDL regular expression may be specified for the dfdl:lengthPattern
> format property and the
> dfdl:testPattern attribute of the dfdl:assert and dfdl:disciminator
> annotations.  DFDL regular
> expressions do not interpret DFDL entities.
>
> A DFDL regular expression is defined by a set of valid pattern characters.
>  For portability,
> it is recommended that the regular expression pattern is restricted to the
> inclusive subset
> of the ICU regular expression [ICURE] and the Java(R) 7 regular expression
> [JAVARE] with the
> Unicode character classes flag (UNICODE_CHARACTER_CLASS) turned on.  The
> following regular expression
> constructs are not common to both ICU and Java(R) 7 and are not
> recommended in a DFDL regular
> expression:
>
> *Construct*                *Meaning*
>     *Notes*
> \N{UNICODE CHARACTER NAME}  Match the named character
>       ICU only
>
> \X                          Match a Grapheme Cluster
>      ICU only
>
> \Uhhhhhhhh                  Match the character with the hex value
> hhhhhhhh.    ICU only
>
> (?# ... )                   Free-format comment
>       ICU only
>
> (?w-w)                      UREGEX_UWORD - Controls the behaviour of \b in
>      ICU only
>                             a pattern.
>
> (?d-d)                      UNIX_LINES - Enables Unix lines mode.
>       Java 7 only
>
> (?u-u)                      UNICODE_CASE - Enables Unicode-aware case
> folding.  Java 7 only -
>
>       always on for
>
>       DFDL
>
> (?U-U)                      UNICODE_CHARACTER_CLASS - Enables the Unicode
>       Java 7 only -
>                             version of Predefined character classes and
> POSIX   always on for
>                             character classes.
>      DFDL
>
> (?imsx-imsx:X)              X, as a non-capturing group with the given
> flags.   Java 7 only
>                             Note that the flags i,s,m,x are valid, but
>                             appending :X to the flag is not.
>
> Additionally, the behaviour of the word character construct (\w) is not
> consistent in ICU and Java(R) 7,
> and is not recommended. In Java (R) 7 \w is
> [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}],
> which is a larger set than ICU where \w is
> [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].
> Character properties are detailed by the Unicode Regular Expressions
> [UNICODERE].
>
>
> *Section 30 to add:*
> [ICURE]     - http://userguide.icu-project.org/strings/regexp
> [JAVARE]    -
> http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
> [UNICODERE] - http://www.unicode.org/reports/tr18/
>
> *Section 30 to remove:*
> [PERLRE] - http://perldoc.perl.org/perlre.html#Extended-Patterns
> [JAVARE] -
> http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
>
>
> Andy  *Andy Edwards* - *WebSphere Message Broker*<http://www-01.ibm.com/software/integration/wbimessagebroker/>-
> *DFDL*<http://w3.ibm.com/bluepedia/display/en/Data+Format+Definition+Language>
>   *Email:* *andy.edwards at uk.ibm.com* <andy.edwards at uk.ibm.com> *Snail
> Mail:*   MP211, Hursley park, Hursley, WINCHESTER, Hants, SO21 2JN *Tel
> int:* 247222 *Tel ext:* +44 (0)1962 817222 *Desk:* DE2 U20
> *The Feynman problem solving Algorithm*
>  1) Write down the problem
>  2) Think real hard
>  3) Write down the answer
> -- Murray Gell-mann in the NY Times
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
> --
>   dfdl-wg mailing list
>   dfdl-wg at ogf.org
>   https://www.ogf.org/mailman/listinfo/dfdl-wg
>



-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130128/31b509ff/attachment-0001.html>


More information about the dfdl-wg mailing list