[DFDL-WG] DFDL regular expressions and Unicode - conformance

Steve Hanson smh at uk.ibm.com
Mon Jul 22 04:43:56 EDT 2013


That looks good to me. Let's close on Tues WG call.

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848



From:   "Cranford, Jonathan W." <jcranford at mitre.org>
To:     Steve Hanson/UK/IBM at IBMGB, Andrew Edwards/UK/IBM at IBMGB, 
Cc:     "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>
Date:   19/07/2013 20:12
Subject:        RE: [DFDL-WG] DFDL regular expressions and Unicode - 
conformance



How does this sound?  I just added a sentence on the end.
 
> A DFDL regular expression is defined by a set of valid pattern 
characters.  For
>portability, a DFDL regular expression pattern is restricted to the 
inclusive subset
>of the ICU regular expression [ICURE] and the Java(R) 7 regular 
expression
>[JAVARE] with the Unicode flags UNICODE_CASE and
>UNICODE_CHARACTER_CLASS turned on.  DFDL regular expressions thereby 
conform to 
Unicode Technical Standard #18 , Unicode Regular Expressions, level 1 
[UNICODERE].
 
 
>-----Original Message-----
>From: Steve Hanson [mailto:smh at uk.ibm.com]
>Sent: Tuesday, July 16, 2013 9:13 AM
>To: Andrew Edwards
>Cc: dfdl-wg at ogf.org; dfdl-wg-bounces at ogf.org; Cranford, Jonathan W.
>Subject: Re: [DFDL-WG] DFDL regular expressions and Unicode - conformance
> 
>Jonathan
> 
>No need for us to contact ICU, as Andy indicates below ICU and Java both 
claim
>conformance.
> 
>Here's the words from errata 3.29.  Please can you rephrase to combine 
the
>conformance requirement and the restrictions, so that we end up with a 
form you
>are happy with, then we can update the errata?
> 
>A DFDL regular expression is defined by a set of valid pattern 
characters.  For
>portability, a DFDL regular expression pattern is restricted to the 
inclusive subset
>of the ICU regular expression [ICURE] and the Java(R) 7 regular 
expression
>[JAVARE] with the Unicode flags UNICODE_CASE and
>UNICODE_CHARACTER_CLASS turned on. 
 
DFDL regular expressions thereby conform to 
Unicode Technical Standard #18 , Unicode Regular Expressions, level 1,
 
> 
>Regards
> 
>Steve Hanson
>Architect, IBM Data Format Description Language (DFDL)
>Co-Chair, OGF DFDL Working Group <http://www.ogf.org/dfdl/>
>IBM SWG, Hursley, UK
>smh at uk.ibm.com <mailto:smh at uk.ibm.com>
>tel:+44-1962-815848
> 
> 
> 
>From:        Andrew Edwards/UK/IBM
>To:        Steve Hanson/UK/IBM at IBMGB,
>Cc:        "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, dfdl-wg-bounces at ogf.org,
>"Cranford, Jonathan W." <jcranford at mitre.org>
>Date:        11/07/2013 14:19
>Subject:        Re: [DFDL-WG] DFDL regular expressions and Unicode
> 
>________________________________
> 
> 
> 
>Hi Jonathan,
> 
>Sorry for the delay; first week back in the office...
> 
>As you've noted, errata 3.29 describes what DFDL regexes are supported.
>Specifically, it is a subset of Java 7's java.util.regex
>(http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
><http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html> ) 
and
>ICU's regular expression support (
http://userguide.icu-project.org/strings/regexp
><http://userguide.icu-project.org/strings/regexp> ), both of which 
conform with
>level 1 of Unicode technical standard #18
> 
>It looks like there are 2 stages to checking conformance:
> 
>*           Logical - do the available regex constructs provide 
conformance to the
>technical standard.  This is probably just a couple of hours of reading 
the Unicode
>standard rules and cross-checking the constructs in each matching engine.
>*           Actual - do Java 7 and ICU really match properly for each of 
the
>conformance statements.  This can take an ever increasing amount of time
>testing various sets of data and regex patterns, and it risks the only 
reward being
>that we find bugs in Java 7 or ICU.  Minimum would be 3 or 4 days of test
>generation.
> 
> 
>Does that answer the issue?
>Andy
>Andy Edwards - IBM Integration Bus <http://www-
>03.ibm.com/software/products/us/en/integration-bus>  - DFDL <https://w3-
>connections.ibm.com/wikis/home?lang=en-
>gb#!/wiki/IBM%20Data%20Format%20Description%20Language>
> 
> 
>Email: andy.edwards at uk.ibm.com <mailto:andy.edwards at uk.ibm.com>
>Snail Mail:         MP211, Hursley park, Hursley, WINCHESTER, Hants, SO21 
2JN
>Tel int:               247222
>Tel ext:              +44 (0)1962 817222
>Desk:   DE3 V17
> 
>The Feynman problem solving Algorithm
> 1) Write down the problem
> 2) Think real hard
> 3) Write down the answer
>-- Murray Gell-mann in the NY Times
> 
> 
> 
> 
> 
>Unless stated otherwise above:
>IBM United Kingdom Limited - Registered in England and Wales with number
>741598.
>Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 
3AU
> 
> 
>Steve Hanson/UK/IBM
> 
>08/07/2013 11:08 To
>"Cranford, Jonathan W." <jcranford at mitre.org>,
>cc
>"dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, dfdl-wg-bounces at ogf.org, Andrew
>Edwards/UK/IBM at IBMGB
>Subject
>Re: [DFDL-WG] DFDL regular expressions and UnicodeLink
><Notes://D06ML014/80256D7F004ED63A/38D46BF5E8F08834852564B500129B2
>C/8054F31FB22A8880A1C918FA98057ED6>
> 
> 
> 
> 
> 
>Jonathan
> 
>I've copied Andy who added regexs support into IBM DFDL recently. He 
might
>have an idea as to the effort involved in stating conformance.
> 
>We will discuss your other two emails on next DFDL-WG call or so.
> 
>Regards
> 
>Steve Hanson
>Architect, IBM Data Format Description Language (DFDL)
>Co-Chair, OGF DFDL Working Group <http://www.ogf.org/dfdl/>
>IBM SWG, Hursley, UK
>smh at uk.ibm.com <mailto:smh at uk.ibm.com>
>tel:+44-1962-815848
> 
> 
> 
>From:        "Cranford, Jonathan W." <jcranford at mitre.org>
>To:        "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>,
>Date:        06/07/2013 00:56
>Subject:        Re: [DFDL-WG] DFDL regular expressions and Unicode
>Sent by:        dfdl-wg-bounces at ogf.org
> 
>________________________________
> 
> 
> 
> 
>Update: I just found errata 3.29, which answers this question, I think.
> 
>From the description in the errata, and looking at the documentation for 
java 7
>regular expressions, it looks like DFDL regular expressions conform to 
level 1 of
>Unicode Regular expressions (UTS#18).
> 
>I still think there would be value in stating such conformance in the 
DFDL spec,
>but I suppose that would take some legwork for someone to actually 
confirm the
>conformance of ICU and Java7 to level 1.
> 
>Very respectfully,
> 
>-- Jonathan Cranford
> 
> 
>>-----Original Message-----
>>From: Cranford, Jonathan W.
>>Sent: Friday, July 05, 2013 1:36 PM
>>To: dfdl-wg at ogf.org
>>Subject: DFDL regular expressions and Unicode
>> 
>>I've been going through the spec recently, and I have a few questions 
about
>DFDL
>>regular expressions.
>> 
>>Rather than put them into one long email, I'll break them up into 
separate
>emails.
>> 
>>First question:  What level of conformance to Unicode Technical Standard 
#18
>>UNICODE
>>    REGULAR EXPRESSIONS do DFDL regular expressions claim?
>> 
>>    For example,
>>    * XML Schema regular expressions are "targeted at support of 'Level 
1'
>>features"
>>        (http://www.w3.org/TR/xmlschema-2/#dt-ccesN
><http://www.w3.org/TR/xmlschema-2/#dt-ccesN> )
>>    * Java 1.4 regular expressions "implement its second level of 
support"
>> 
>(
http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
><
http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html> 
)
>>    * Perl 5.18 seems to implement most of Level 1
>>        (
http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-
><http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression->
>>Support-Level)
>> 
>>    I think the conformance level should be specified in the DFDL spec 
so that it is
>>clear to schema
>>    designers what a regular expression would really match against. 
Details
>>    like case conversion and canonical equivalence make a difference 
when
>>    matching against a Unicode string.
>> 
>>Thanks in advance,
>> 
>>--
>>Jonathan W. Cranford <jcranford at mitre.org>
>>Senior Information Systems Engineer
>>The MITRE Corporation (http://www.mitre.org <http://www.mitre.org/> )
> 
>--
> dfdl-wg mailing list
> dfdl-wg at ogf.org
> https://www.ogf.org/mailman/listinfo/dfdl-wg
><https://www.ogf.org/mailman/listinfo/dfdl-wg>
> 
> 
> 
 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130722/85e05769/attachment-0001.html>


More information about the dfdl-wg mailing list