[DFDL-WG] for errata r14 - Fwd: DFDL regular expressions and Unicode - conformance

Steve Hanson smh at uk.ibm.com
Wed Jul 24 07:34:10 EDT 2013


Correct.

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848



From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     Steve Hanson/UK/IBM at IBMGB, 
Cc:     dfdl-wg at ogf.org
Date:   24/07/2013 01:33
Subject:        for errata r14 - Fwd: [DFDL-WG] DFDL regular expressions 
and Unicode - conformance




I am assuming this issue will get handled as part of a r14 erratum.

---------- Forwarded message ----------
From: Steve Hanson <smh at uk.ibm.com>
Date: Tue, Jul 16, 2013 at 11:13 AM
Subject: Re: [DFDL-WG] DFDL regular expressions and Unicode - conformance
To: Andrew Edwards <andy.edwards at uk.ibm.com>
Cc: "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, dfdl-wg-bounces at ogf.org


Jonathan 

No need for us to contact ICU, as Andy indicates below ICU and Java both 
claim conformance.   

Here's the words from errata 3.29.  Please can you rephrase to combine the 
conformance requirement and the restrictions, so that we end up with a 
form you are happy with, then we can update the errata? 

A DFDL regular expression is defined by a set of valid pattern characters. 
 For portability, a DFDL regular expression pattern is restricted to the 
inclusive subset of the ICU regular expression [ICURE] and the Java(R) 7 
regular expression [JAVARE] with the Unicode flags UNICODE_CASE and 
UNICODE_CHARACTER_CLASS turned on. 

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848 



From:        Andrew Edwards/UK/IBM 
To:        Steve Hanson/UK/IBM at IBMGB, 
Cc:        "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, dfdl-wg-bounces at ogf.org, 
"Cranford, Jonathan W." <jcranford at mitre.org> 
Date:        11/07/2013 14:19 
Subject:        Re: [DFDL-WG] DFDL regular expressions and Unicode 


Hi Jonathan, 

Sorry for the delay; first week back in the office... 

As you've noted, errata 3.29 describes what DFDL regexes are supported. 
 Specifically, it is a subset of Java 7's java.util.regex (
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html) and 
ICU's regular expression support (
http://userguide.icu-project.org/strings/regexp), both of which conform 
with level 1 of Unicode technical standard #18 

It looks like there are 2 stages to checking conformance: 
Logical - do the available regex constructs provide conformance to the 
technical standard.  This is probably just a couple of hours of reading 
the Unicode standard rules and cross-checking the constructs in each 
matching engine. 
Actual - do Java 7 and ICU really match properly for each of the 
conformance statements.  This can take an ever increasing amount of time 
testing various sets of data and regex patterns, and it risks the only 
reward being that we find bugs in Java 7 or ICU.  Minimum would be 3 or 4 
days of test generation.

Does that answer the issue? 
Andy 
Andy Edwards - IBM Integration Bus - DFDL 


Email: 
andy.edwards at uk.ibm.com 
Snail Mail:   
MP211, Hursley park, Hursley, WINCHESTER, Hants, SO21 2JN 
Tel int: 
247222 
Tel ext: 
+44 (0)1962 817222 
Desk: 
DE3 V17

The Feynman problem solving Algorithm
 1) Write down the problem
 2) Think real hard
 3) Write down the answer
-- Murray Gell-mann in the NY Times





Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Steve Hanson/UK/IBM 
08/07/2013 11:08 


To
"Cranford, Jonathan W." <jcranford at mitre.org>, 
cc
"dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, dfdl-wg-bounces at ogf.org, Andrew 
Edwards/UK/IBM at IBMGB 
Subject
Re: [DFDL-WG] DFDL regular expressions and UnicodeLink







Jonathan 

I've copied Andy who added regexs support into IBM DFDL recently. He might 
have an idea as to the effort involved in stating conformance. 

We will discuss your other two emails on next DFDL-WG call or so. 

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848 



From:        "Cranford, Jonathan W." <jcranford at mitre.org> 
To:        "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, 
Date:        06/07/2013 00:56 
Subject:        Re: [DFDL-WG] DFDL regular expressions and Unicode 
Sent by:        dfdl-wg-bounces at ogf.org 



Update: I just found errata 3.29, which answers this question, I think.

>From the description in the errata, and looking at the documentation for 
java 7 regular expressions, it looks like DFDL regular expressions conform 
to level 1 of Unicode Regular expressions (UTS#18).

I still think there would be value in stating such conformance in the DFDL 
spec, but I suppose that would take some legwork for someone to actually 
confirm the conformance of ICU and Java7 to level 1.

Very respectfully,

-- Jonathan Cranford


>-----Original Message-----
>From: Cranford, Jonathan W.
>Sent: Friday, July 05, 2013 1:36 PM
>To: dfdl-wg at ogf.org
>Subject: DFDL regular expressions and Unicode
>
>I've been going through the spec recently, and I have a few questions 
about DFDL
>regular expressions.
>
>Rather than put them into one long email, I'll break them up into 
separate emails.
>
>First question:  What level of conformance to Unicode Technical Standard 
#18
>UNICODE
>    REGULAR EXPRESSIONS do DFDL regular expressions claim?
>
>    For example,
>    * XML Schema regular expressions are "targeted at support of 'Level 
1'
>features"
>        (http://www.w3.org/TR/xmlschema-2/#dt-ccesN)
>    * Java 1.4 regular expressions "implement its second level of 
support"
>        (
http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html)
>    * Perl 5.18 seems to implement most of Level 1
>        (
http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-
>Support-Level)
>
>    I think the conformance level should be specified in the DFDL spec so 
that it is
>clear to schema
>    designers what a regular expression would really match against. 
 Details
>    like case conversion and canonical equivalence make a difference when
>    matching against a Unicode string.
>
>Thanks in advance,
>
>--
>Jonathan W. Cranford <jcranford at mitre.org>
>Senior Information Systems Engineer
>The MITRE Corporation (http://www.mitre.org)

--
 dfdl-wg mailing list
 dfdl-wg at ogf.org
 https://www.ogf.org/mailman/listinfo/dfdl-wg




--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg



-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130724/03db227b/attachment-0001.html>


More information about the dfdl-wg mailing list