[DFDL-WG] regex free-spacing mode
Steve Hanson
smh at uk.ibm.com
Mon Jul 8 06:31:06 EDT 2013
Andy, did you test both ICU4J and ICU4C ?
Regards
Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
From: Andrew Edwards/UK/IBM
To: Steve Hanson/UK/IBM at IBMGB,
Cc: dfdl-wg at ogf.org, dfdl-wg-bounces at ogf.org, Mike Beckerle
<mbeckerle.dfdl at gmail.com>
Date: 27/06/2013 17:18
Subject: Re: [DFDL-WG] regex free-spacing mode
Hi Mike (et al),
I've gone back to the ICU doc for this and run a few tests locally. It
looks like both cases for non capturing groups can now be used in Java 7
and ICU 51.1. In other words, both of the following constructs are
supported:
(?imsx-imsx)
(?imsx-imsx:X)
So the quick answer is that what you are trying to do in your example
below is supported.
The long answer is that errata 3.29 can probably be updated by removing
the restriction on (?imsx-imsx:X), as below
3.29 Sections 24 and 30. The DFDL specification is not prescriptive
enough when specifying what is allowed for regular expressions used in the
length property and testPattern property.
Section 24 is replaced by the following.
"A DFDL regular expression may be specified for the dfdl:lengthPattern
format property and the dfdl:testPattern attribute of the dfdl:assert and
dfdl:discriminator annotations. DFDL regular expressions do not interpret
DFDL entities.
A DFDL regular expression is defined by a set of valid pattern characters.
For portability, a DFDL regular expression pattern is restricted to the
inclusive subset of the ICU regular expression [ICURE] and the Java(R) 7
regular expression [JAVARE] with the Unicode flags UNICODE_CASE and
UNICODE_CHARACTER_CLASS turned on. The following regular expression
constructs are not common to both ICU and Java(R) 7 and it is a schema
definition error if any are used in a DFDL regular expression:
*Construct* *Meaning*
*Notes*
\N{UNICODE CHARACTER NAME} Match the named character
ICU only
\X Match a Grapheme Cluster
ICU only
\Uhhhhhhhh Match the character with the hex value
hhhhhhhh. ICU only
(?# ... ) Free-format comment
ICU only
(?w-w) UREGEX_UWORD - Controls the behaviour of \b in
ICU only
a pattern.
(?d-d) UNIX_LINES - Enables Unix lines mode.
Java 7 only
(?u-u) UNICODE_CASE - Enables Unicode-aware case
folding. Java 7 only (1)
(?U-U) UNICODE_CHARACTER_CLASS - Enables the Unicode
Java 7 only (1)
version of Predefined character classes and
POSIX
character classes.
(?imsx-imsx:X) X, as a non-capturing group with the given
flags. Java 7 only
Note that the flags i,s,m,x are valid, but
appending :X to the flag is not.
Notes:
(1) Implementations using Java 7 must set flag UNICODE_CASE by default to
match ICU.
(2) Implementations using Java 7 must set flag UNICODE_CHARACTER_CLASS by
default to match ICU.
Additionally, the behaviour of the word character construct (\w) is not
consistent in ICU and Java 7. In Java 7 \w is
[\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}],which is a larger
set than ICU where \w is [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]. The use of \w
is not recommended in DFDL regular expressions in conjunction with Unicode
encodings, and an implementation must issue a warning if such usage is
detected.
Character properties are detailed by the Unicode Re gular Expressions
[UNICODERE]."
Section 30 is updated to correct the references used in section 24:
-Add:[ICURE] - http://userguide.icu-project.org/strings/regexp
-Add:[UNICODERE] - http://www.unicode.org/reports/tr18/
-Remove:[PERLRE] - http://perldoc.perl.org/perlre.html#Extended-Patterns
-Change:[JAVARE] -
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Cheers,
Andy
Andy Edwards - IBM Integration Bus - DFDL
Email:
andy.edwards at uk.ibm.com
Snail Mail:
MP211, Hursley park, Hursley, WINCHESTER, Hants, SO21 2JN
Tel int:
247222
Tel ext:
+44 (0)1962 817222
Desk:
DE3 V17
The Feynman problem solving Algorithm
1) Write down the problem
2) Think real hard
3) Write down the answer
-- Murray Gell-mann in the NY Times
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Steve Hanson/UK/IBM
27/06/2013 10:26
To
Mike Beckerle <mbeckerle.dfdl at gmail.com>,
cc
dfdl-wg at ogf.org, dfdl-wg-bounces at ogf.org, Andrew Edwards/UK/IBM at IBMGB
Subject
Re: [DFDL-WG] regex free-spacing mode
Mike, I believe that is the case but I have copied Andy Edwards who is the
person in the IBM DFDL team who added our regex support.
Regards
Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
From: Mike Beckerle <mbeckerle.dfdl at gmail.com>
To: dfdl-wg at ogf.org,
Date: 26/06/2013 18:56
Subject: Re: [DFDL-WG] regex free-spacing mode
Sent by: dfdl-wg-bounces at ogf.org
To clarify, errata v13 has this in the table for erratum 3.29 in the list
of non-portables:
(?imsx-imsx:X)
X, as a non-capturing group with the
given flags. Note that the flags i,s,m,x
are valid, but appending :X to the flag is
not.
Java 7 only
I interpret this as meaning that only the so-called modifier-span notation
(the : suffix) is disallowed, but not just plain (?x), but I wanted to be
sure that was the correct interpretation.
On Wed, Jun 26, 2013 at 1:13 PM, Mike Beckerle <mbeckerle.dfdl at gmail.com>
wrote:
I wrote this complicated regex today and it works in Daffodil.
Question is this. Is the (?x) which turns on regex free-spacing mode,
officially supported in DFDL?
You can see from below that it is VERY desirable that it works.....
<xs:simpleType name="frontMatterType">
<xs:annotation>
<xs:appinfo source="http://www.ogf.org/dfdl/">
<dfdl:simpleType lengthKind="pattern" terminator="%FF;">
<dfdl:property name="lengthPattern"><![CDATA[(?x) # regex free
spacing mode
#
# match the front matter of the document
#
.{1,8192}? # up to 8K of front matter content
#
# front matter ends at the first message description page
#
(?= # lookahead (followed by but not
including...)
\f # a formfeed character
(?> \s | \x08 ){1,100}? # whitespace or backspace (x08)
MESSAGE\ DESCRIPTION\r # this literal text
\s{1,100}? # up to 100 whitespaces
-{19}\r # exactly 19 hyphens and a CR
) # end lookahead
]]></dfdl:property>
</dfdl:simpleType>
</xs:appinfo>
</xs:annotation>
<xs:restriction base="xs:string" />
</xs:simpleType>
--
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
--
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
--
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130708/34bf4d9b/attachment-0001.html>
More information about the dfdl-wg
mailing list