[DFDL-WG] DFDL regular expressions and Unicode - conformance

Cranford, Jonathan W. jcranford at mitre.org
Fri Jul 19 15:12:26 EDT 2013


How does this sound?  I just added a sentence on the end.



> A DFDL regular expression is defined by a set of valid pattern characters.  For

>portability, a DFDL regular expression pattern is restricted to the inclusive subset

>of the ICU regular expression [ICURE] and the Java(R) 7 regular expression

>[JAVARE] with the Unicode flags UNICODE_CASE and

>UNICODE_CHARACTER_CLASS turned on.  DFDL regular expressions thereby conform to

Unicode Technical Standard #18 , Unicode Regular Expressions, level 1 [UNICODERE].





>-----Original Message-----

>From: Steve Hanson [mailto:smh at uk.ibm.com]

>Sent: Tuesday, July 16, 2013 9:13 AM

>To: Andrew Edwards

>Cc: dfdl-wg at ogf.org; dfdl-wg-bounces at ogf.org; Cranford, Jonathan W.

>Subject: Re: [DFDL-WG] DFDL regular expressions and Unicode - conformance

>

>Jonathan

>

>No need for us to contact ICU, as Andy indicates below ICU and Java both claim

>conformance.

>

>Here's the words from errata 3.29.  Please can you rephrase to combine the

>conformance requirement and the restrictions, so that we end up with a form you

>are happy with, then we can update the errata?

>

>A DFDL regular expression is defined by a set of valid pattern characters.  For

>portability, a DFDL regular expression pattern is restricted to the inclusive subset

>of the ICU regular expression [ICURE] and the Java(R) 7 regular expression

>[JAVARE] with the Unicode flags UNICODE_CASE and

>UNICODE_CHARACTER_CLASS turned on.



DFDL regular expressions thereby conform to

Unicode Technical Standard #18 , Unicode Regular Expressions, level 1,



>

>Regards

>

>Steve Hanson

>Architect, IBM Data Format Description Language (DFDL)

>Co-Chair, OGF DFDL Working Group <http://www.ogf.org/dfdl/>

>IBM SWG, Hursley, UK

>smh at uk.ibm.com<mailto:smh at uk.ibm.com> <mailto:smh at uk.ibm.com>

>tel:+44-1962-815848

>

>

>

>From:        Andrew Edwards/UK/IBM

>To:        Steve Hanson/UK/IBM at IBMGB,

>Cc:        "dfdl-wg at ogf.org<mailto:dfdl-wg at ogf.org>" <dfdl-wg at ogf.org<mailto:dfdl-wg at ogf.org>>, dfdl-wg-bounces at ogf.org<mailto:dfdl-wg-bounces at ogf.org>,

>"Cranford, Jonathan W." <jcranford at mitre.org<mailto:jcranford at mitre.org>>

>Date:        11/07/2013 14:19

>Subject:        Re: [DFDL-WG] DFDL regular expressions and Unicode

>

>________________________________

>

>

>

>Hi Jonathan,

>

>Sorry for the delay; first week back in the office...

>

>As you've noted, errata 3.29 describes what DFDL regexes are supported.

>Specifically, it is a subset of Java 7's java.util.regex

>(http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

><http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html> ) and

>ICU's regular expression support (http://userguide.icu-project.org/strings/regexp

><http://userguide.icu-project.org/strings/regexp> ), both of which conform with

>level 1 of Unicode technical standard #18

>

>It looks like there are 2 stages to checking conformance:

>

>*           Logical - do the available regex constructs provide conformance to the

>technical standard.  This is probably just a couple of hours of reading the Unicode

>standard rules and cross-checking the constructs in each matching engine.

>*           Actual - do Java 7 and ICU really match properly for each of the

>conformance statements.  This can take an ever increasing amount of time

>testing various sets of data and regex patterns, and it risks the only reward being

>that we find bugs in Java 7 or ICU.  Minimum would be 3 or 4 days of test

>generation.

>

>

>Does that answer the issue?

>Andy

>Andy Edwards - IBM Integration Bus <http://www-<http://www-03.ibm.com/software/products/us/en/integration-bus>

>03.ibm.com/software/products/us/en/integration-bus<http://www-03.ibm.com/software/products/us/en/integration-bus>>  - DFDL <https://w3-<https://w3-connections.ibm.com/wikis/home?lang=en-gb#!/wiki/IBM%20Data%20Format%20Description%20Language>

>connections.ibm.com/wikis/home?lang=en-<https://w3-connections.ibm.com/wikis/home?lang=en-gb#!/wiki/IBM%20Data%20Format%20Description%20Language>

>gb#!/wiki/IBM%20Data%20Format%20Description%20Language<https://w3-connections.ibm.com/wikis/home?lang=en-gb#!/wiki/IBM%20Data%20Format%20Description%20Language>>

>

>

>Email: andy.edwards at uk.ibm.com<mailto:andy.edwards at uk.ibm.com> <mailto:andy.edwards at uk.ibm.com>

>Snail Mail:         MP211, Hursley park, Hursley, WINCHESTER, Hants, SO21 2JN

>Tel int:               247222

>Tel ext:              +44 (0)1962 817222

>Desk:   DE3 V17

>

>The Feynman problem solving Algorithm

> 1) Write down the problem

> 2) Think real hard

> 3) Write down the answer

>-- Murray Gell-mann in the NY Times

>

>

>

>

>

>Unless stated otherwise above:

>IBM United Kingdom Limited - Registered in England and Wales with number

>741598.

>Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

>

>

>Steve Hanson/UK/IBM

>

>08/07/2013 11:08 To

>"Cranford, Jonathan W." <jcranford at mitre.org<mailto:jcranford at mitre.org>>,

>cc

>"dfdl-wg at ogf.org<mailto:dfdl-wg at ogf.org>" <dfdl-wg at ogf.org<mailto:dfdl-wg at ogf.org>>, dfdl-wg-bounces at ogf.org<mailto:dfdl-wg-bounces at ogf.org>, Andrew

>Edwards/UK/IBM at IBMGB

>Subject

>Re: [DFDL-WG] DFDL regular expressions and UnicodeLink

><Notes://D06ML014/80256D7F004ED63A/38D46BF5E8F08834852564B500129B2<Notes://D06ML014/80256D7F004ED63A/38D46BF5E8F08834852564B500129B2C/8054F31FB22A8880A1C918FA98057ED6>

>C/8054F31FB22A8880A1C918FA98057ED6<Notes://D06ML014/80256D7F004ED63A/38D46BF5E8F08834852564B500129B2C/8054F31FB22A8880A1C918FA98057ED6>>

>

>

>

>

>

>Jonathan

>

>I've copied Andy who added regexs support into IBM DFDL recently. He might

>have an idea as to the effort involved in stating conformance.

>

>We will discuss your other two emails on next DFDL-WG call or so.

>

>Regards

>

>Steve Hanson

>Architect, IBM Data Format Description Language (DFDL)

>Co-Chair, OGF DFDL Working Group <http://www.ogf.org/dfdl/>

>IBM SWG, Hursley, UK

>smh at uk.ibm.com<mailto:smh at uk.ibm.com> <mailto:smh at uk.ibm.com>

>tel:+44-1962-815848

>

>

>

>From:        "Cranford, Jonathan W." <jcranford at mitre.org<mailto:jcranford at mitre.org>>

>To:        "dfdl-wg at ogf.org<mailto:dfdl-wg at ogf.org>" <dfdl-wg at ogf.org<mailto:dfdl-wg at ogf.org>>,

>Date:        06/07/2013 00:56

>Subject:        Re: [DFDL-WG] DFDL regular expressions and Unicode

>Sent by:        dfdl-wg-bounces at ogf.org<mailto:dfdl-wg-bounces at ogf.org>

>

>________________________________

>

>

>

>

>Update: I just found errata 3.29, which answers this question, I think.

>

>From the description in the errata, and looking at the documentation for java 7

>regular expressions, it looks like DFDL regular expressions conform to level 1 of

>Unicode Regular expressions (UTS#18).

>

>I still think there would be value in stating such conformance in the DFDL spec,

>but I suppose that would take some legwork for someone to actually confirm the

>conformance of ICU and Java7 to level 1.

>

>Very respectfully,

>

>-- Jonathan Cranford

>

>

>>-----Original Message-----

>>From: Cranford, Jonathan W.

>>Sent: Friday, July 05, 2013 1:36 PM

>>To: dfdl-wg at ogf.org<mailto:dfdl-wg at ogf.org>

>>Subject: DFDL regular expressions and Unicode

>>

>>I've been going through the spec recently, and I have a few questions about

>DFDL

>>regular expressions.

>>

>>Rather than put them into one long email, I'll break them up into separate

>emails.

>>

>>First question:  What level of conformance to Unicode Technical Standard #18

>>UNICODE

>>    REGULAR EXPRESSIONS do DFDL regular expressions claim?

>>

>>    For example,

>>    * XML Schema regular expressions are "targeted at support of 'Level 1'

>>features"

>>        (http://www.w3.org/TR/xmlschema-2/#dt-ccesN

><http://www.w3.org/TR/xmlschema-2/#dt-ccesN> )

>>    * Java 1.4 regular expressions "implement its second level of support"

>>

>(http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html

><http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html> )

>>    * Perl 5.18 seems to implement most of Level 1

>>        (http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-

><http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression->

>>Support-Level)

>>

>>    I think the conformance level should be specified in the DFDL spec so that it is

>>clear to schema

>>    designers what a regular expression would really match against.  Details

>>    like case conversion and canonical equivalence make a difference when

>>    matching against a Unicode string.

>>

>>Thanks in advance,

>>

>>--

>>Jonathan W. Cranford <jcranford at mitre.org<mailto:jcranford at mitre.org>>

>>Senior Information Systems Engineer

>>The MITRE Corporation (http://www.mitre.org <http://www.mitre.org/> )

>

>--

> dfdl-wg mailing list

> dfdl-wg at ogf.org<mailto:dfdl-wg at ogf.org>

> https://www.ogf.org/mailman/listinfo/dfdl-wg

><https://www.ogf.org/mailman/listinfo/dfdl-wg>

>

>

>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130719/e69b22f5/attachment-0001.html>


More information about the dfdl-wg mailing list