[DFDL-WG] regex free-spacing mode

Steve Hanson smh at uk.ibm.com
Mon Jul 8 06:31:06 EDT 2013


Andy, did you test both ICU4J and ICU4C ?

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848



From:   Andrew Edwards/UK/IBM
To:     Steve Hanson/UK/IBM at IBMGB, 
Cc:     dfdl-wg at ogf.org, dfdl-wg-bounces at ogf.org, Mike Beckerle 
<mbeckerle.dfdl at gmail.com>
Date:   27/06/2013 17:18
Subject:        Re: [DFDL-WG] regex free-spacing mode


Hi Mike (et al),

I've gone back to the ICU doc for this and run a few tests locally.  It 
looks like both cases for non capturing groups can now be used in Java 7 
and ICU 51.1.  In other words, both of the following constructs are 
supported:

        (?imsx-imsx)
        (?imsx-imsx:X)

So the quick answer is that what you are trying to do in your example 
below is supported.

The long answer is that errata 3.29 can probably be updated by removing 
the restriction on (?imsx-imsx:X), as below


3.29  Sections 24 and 30. The DFDL specification is not prescriptive 
enough when specifying what is allowed for regular expressions used in the 
length property and testPattern property.

Section 24 is replaced by the following.

"A DFDL regular expression may be specified for the dfdl:lengthPattern 
format property and the dfdl:testPattern attribute of the dfdl:assert and 
dfdl:discriminator annotations. DFDL regular expressions do not interpret 
DFDL entities.

A DFDL regular expression is defined by a set of valid pattern characters. 
For portability, a DFDL regular expression pattern is restricted to the 
inclusive subset of the ICU regular expression [ICURE] and the Java(R) 7 
regular expression [JAVARE] with the Unicode flags UNICODE_CASE and 
UNICODE_CHARACTER_CLASS turned on.  The following regular expression 
constructs are not common to both ICU and Java(R) 7 and it is a schema 
definition error if any are used in a DFDL regular expression:

*Construct*                *Meaning*                                       
    *Notes* 
\N{UNICODE CHARACTER NAME}  Match the named character                     
      ICU only 

\X                          Match a Grapheme Cluster                       
     ICU only 

\Uhhhhhhhh                  Match the character with the hex value 
hhhhhhhh.    ICU only 

(?# ... )                   Free-format comment                           
      ICU only 

(?w-w)                      UREGEX_UWORD - Controls the behaviour of \b in 
     ICU only 
                            a pattern. 

(?d-d)                      UNIX_LINES - Enables Unix lines mode.         
      Java 7 only 

(?u-u)                      UNICODE_CASE - Enables Unicode-aware case 
folding.  Java 7 only (1) 

(?U-U)                      UNICODE_CHARACTER_CLASS - Enables the Unicode 
      Java 7 only (1) 
                            version of Predefined character classes and 
POSIX   
                            character classes.                             
      

(?imsx-imsx:X)              X, as a non-capturing group with the given 
flags.   Java 7 only 
                            Note that the flags i,s,m,x are valid, but 
                            appending :X to the flag is not.


Notes:
(1) Implementations using Java 7 must set flag UNICODE_CASE by default to 
match ICU.
(2) Implementations using Java 7 must set flag UNICODE_CHARACTER_CLASS by 
default to match ICU.

Additionally, the behaviour of the word character construct (\w) is not 
consistent in ICU and Java 7. In Java 7 \w is 
[\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}],which is a larger 
set than ICU where \w is [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].  The use of \w 
is not recommended in DFDL regular expressions in conjunction with Unicode 
encodings, and an implementation must issue a warning if such usage is 
detected.

Character properties are detailed by the Unicode Re gular Expressions 
[UNICODERE]."

Section 30 is updated to correct the references used in section 24:

-Add:[ICURE] - http://userguide.icu-project.org/strings/regexp
-Add:[UNICODERE] - http://www.unicode.org/reports/tr18/
-Remove:[PERLRE] - http://perldoc.perl.org/perlre.html#Extended-Patterns
-Change:[JAVARE] - 
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html 



Cheers,
Andy 
Andy Edwards - IBM Integration Bus - DFDL


Email:
andy.edwards at uk.ibm.com
Snail Mail: 
MP211, Hursley park, Hursley, WINCHESTER, Hants, SO21 2JN
Tel int:
247222
Tel ext:
+44 (0)1962 817222
Desk:
DE3 V17

The Feynman problem solving Algorithm
  1) Write down the problem
  2) Think real hard
  3) Write down the answer
 -- Murray Gell-mann in the NY Times




Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Steve Hanson/UK/IBM
27/06/2013 10:26

To
Mike Beckerle <mbeckerle.dfdl at gmail.com>, 
cc
dfdl-wg at ogf.org, dfdl-wg-bounces at ogf.org, Andrew Edwards/UK/IBM at IBMGB
Subject
Re: [DFDL-WG] regex free-spacing mode





Mike, I believe that is the case but I have copied Andy Edwards who is the 
person in the IBM DFDL team who added our regex support.

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848



From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     dfdl-wg at ogf.org, 
Date:   26/06/2013 18:56
Subject:        Re: [DFDL-WG] regex free-spacing mode
Sent by:        dfdl-wg-bounces at ogf.org



To clarify, errata v13 has this in the table for erratum 3.29 in the list 
of non-portables:

(?imsx-imsx:X)

X, as a non-capturing group with the 
given flags. Note that the flags i,s,m,x 
are valid, but appending :X to the flag is 
not.

Java 7 only 

I interpret this as meaning that only the so-called modifier-span notation 
(the : suffix) is disallowed, but not just plain (?x), but I wanted to be 
sure that was the correct interpretation.


On Wed, Jun 26, 2013 at 1:13 PM, Mike Beckerle <mbeckerle.dfdl at gmail.com> 
wrote:

I wrote this complicated regex today and it works in Daffodil. 

Question is this. Is the (?x) which turns on regex free-spacing mode, 
officially supported in DFDL?

You can see from below that it is VERY desirable that it works..... 

  <xs:simpleType name="frontMatterType">
      <xs:annotation>
        <xs:appinfo source="http://www.ogf.org/dfdl/">
          <dfdl:simpleType lengthKind="pattern" terminator="%FF;">

            <dfdl:property name="lengthPattern"><![CDATA[(?x) # regex free 
spacing mode
            #
            # match the front matter of the document
            #
            .{1,8192}?                # up to 8K of front matter content
            #
            # front matter ends at the first message description page
            #
            (?=                       # lookahead (followed by but not 
including...)
              \f                      # a formfeed character
              (?> \s | \x08 ){1,100}? # whitespace or backspace (x08)
              MESSAGE\ DESCRIPTION\r  # this literal text
              \s{1,100}?              # up to 100 whitespaces
              -{19}\r                 # exactly 19 hyphens and a CR
            )                         # end lookahead 
            ]]></dfdl:property>

           </dfdl:simpleType>
        </xs:appinfo>
      </xs:annotation>
      <xs:restriction base="xs:string" />
    </xs:simpleType>

-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com




-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com
--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130708/34bf4d9b/attachment-0001.html>


More information about the dfdl-wg mailing list