[DFDL-WG] DFDL regular expressions: using Unicode features to match against non-Unicode encoding

Cranford, Jonathan W. jcranford at mitre.org
Fri Jul 5 18:06:17 EDT 2013


Next question: What should happen if a regular expression uses a Unicode 
block or category to match against a string in a non-Unicode encoding?
Should that be a Schema Definition Error, or should the regular expression
just silently fail to match anything?  I would prefer a Schema Definition
Error, even though detecting such a condition would be difficult.

For example:

<?xml encoding="UTF-8"?>
...
<xs:element name="foo" type="xs:string" dfdl:encoding="Shift_JIS">
  <xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/">
        <dfdl:assert testKind="pattern" testPattern="\p{InGreek}"/>
    </xs:appinfo>
  </xs:annotation>
</xs:element>
...

\p{InGreek} matches a character in the Greek Unicode block, but
such a category is incongruent with the Shift_JIS encoding.  In short,
any Unicode block or category would not make sense against a non-Unicode
encoding; for example, \p{Lu} matches a character in the uppercase letter 
category, but the list of Unicode characters in that category cannot
be easily compared to a Shift_JIS encoding. 

Thanks in advance,
    
 --
Jonathan W. Cranford 
Senior Information Systems Engineer
The MITRE Corporation (http://www.mitre.org)




More information about the dfdl-wg mailing list