[DFDL-WG] Clarification needed: regular expressions - does '.' match newlines by default?

Steve Hanson smh at uk.ibm.com
Fri Nov 16 03:38:30 EST 2012


Let's be clear on the two kinds of regex that DFDL requires.

1) Regexs as used in lengthKind 'pattern' and testKind 'pattern' must 
absolutely not be XML schema regexs. They are way too restrictive and 
don't allow any of the look-ahead capability that you get with Java or 
PERL. This has caused no end of problems with IBM MRM's TDS pattern 
facility.

2) Regexs as used in the xs:pattern facet for validation. These must be 
regular XSDL regexs so that a DFDL schema is a genuine XML Schema.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848



From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     Suman Kalia <kalia at ca.ibm.com>, 
Cc:     dfdl-wg at ogf.org
Date:   14/11/2012 18:24
Subject:        Re: [DFDL-WG] Clarification needed: regular expressions - 
does '.' match newlines by default?
Sent by:        dfdl-wg-bounces at ogf.org




I was in a meeting the other day where a number of people said they 
believe the regex capabilities offered in XML Schema are not sufficient.

I am not exactly sure what XML Schema leaves out, but I have many examples 
making use of look-ahead/look-behind features, and I suspect those may be 
an issue.

...mike

On Wed, Nov 14, 2012 at 12:59 PM, Suman Kalia <kalia at ca.ibm.com> wrote:
I came across this issue couple of weeks ago..  the regular expression 
syntax used in XML Schema is strict than what is supported in Java regular 
expression.  DFDL regular expression syntax and restrictions should match 
XML schema specification..   

Here is an example for which APAR has been opened and we will supplying 
fix in WMB toolkit to make regular expression comply to the XML Schema 
spec... 

The following line causes the XML schema compiler to fail -               
          
                                                                        
<xsd:pattern value="([a-zA-Z0-9 ]|\-|\.|_|\(|\)|\\|\/|.&|\')*"/>     
                                                                        
Here the customer has escaped  forward slash and single quote characters. 
Instead of \/ it should be / and instead of \' it should be '             
                                          
                                                                        
Following is accepted by XML Schema compiler..                             

                                                                        
<xsd:pattern value="([a-zA-Z0-9 ]|\-|\.|_|\(|\)|\\|/|.&|')*"/>       
                                                                        




Suman Kalia 
IBM Canada Lab 
WMB Toolkit Architect and Development Lead 
Tel: 905-413-3923 T/L 313-3923 
Email: kalia at ca.ibm.com 

For info on Message broker 
http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.html 






From:        Mike Beckerle <mbeckerle.dfdl at gmail.com> 
To:        Tim Kimber <KIMBERT at uk.ibm.com>, 
Cc:        dfdl-wg at ogf.org, dfdl-wg-bounces at ogf.org 
Date:        11/14/2012 12:46 PM 
Subject:        Re: [DFDL-WG] Clarification needed: regular expressions - 
does '.' match newlines by default? 
Sent by:        dfdl-wg-bounces at ogf.org 



I agree with Tim's opinion, but add that this is *NOT* the default 
behavior of the java regex library we're using in Daffodil currently. One 
must prefix all regex's by (?s) I believe to achieve the non-default 
line-ending behavior.

On Wed, Nov 14, 2012 at 11:15 AM, Tim Kimber <KIMBERT at uk.ibm.com> wrote: 
I would vote for this feature to be switched off by default in DFDL 
processors. It is mainly useful when dealing with lines of text, but DFDL 
formats are not always lines of text. 
So to be 100% clear, I think the '.' wildcard should match all characters, 
including line endings. 

regards,

Tim Kimber, DFDL Team,
Hursley, UK
Internet:  kimbert at uk.ibm.com
Tel. 01962-816742  
Internal tel. 37246742




From:        Mike Beckerle <mbeckerle.dfdl at gmail.com> 
To:        dfdl-wg at ogf.org, 
Date:        14/11/2012 12:53 
Subject:        [DFDL-WG] Clarification needed: regular expressions - does 
'.' match newlines by default? 
Sent by:        dfdl-wg-bounces at ogf.org 





A key behavior distinction in regular expressions is whether the '.' 
wildcard matches line endings or not. 

Regular expression libraries can be configured, usually by some sort of 
expression modifier, either way so that the '.' will not match a line 
ending or so that it will.

Question is, how is it configured by default in DFDL regular expressions?

This is part of the overall issue of tightening up regular expressions as 
part of DFDL. I.e., what exactly is the regex dialect, and how is it 
configured by default.

...mike

-- 
Mike Beckerle | OGF DFDL WG Co-Chair 
Tel:  781-330-0412 
--
 dfdl-wg mailing list
 dfdl-wg at ogf.org
 https://www.ogf.org/mailman/listinfo/dfdl-wg 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 




-- 
Mike Beckerle | OGF DFDL WG Co-Chair 
Tel:  781-330-0412
--
 dfdl-wg mailing list
 dfdl-wg at ogf.org
 https://www.ogf.org/mailman/listinfo/dfdl-wg 



-- 
Mike Beckerle | OGF DFDL WG Co-Chair 
Tel:  781-330-0412
--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20121116/c93b4858/attachment.html>


More information about the dfdl-wg mailing list