[DFDL-WG] DFDL regular expressions: using Unicode features to match against non-Unicode encoding
Cranford, Jonathan W.
jcranford at mitre.org
Fri Jul 5 18:06:17 EDT 2013
Next question: What should happen if a regular expression uses a Unicode
block or category to match against a string in a non-Unicode encoding?
Should that be a Schema Definition Error, or should the regular expression
just silently fail to match anything? I would prefer a Schema Definition
Error, even though detecting such a condition would be difficult.
For example:
<?xml encoding="UTF-8"?>
...
<xs:element name="foo" type="xs:string" dfdl:encoding="Shift_JIS">
<xs:annotation>
<xs:appinfo source="http://www.ogf.org/dfdl/">
<dfdl:assert testKind="pattern" testPattern="\p{InGreek}"/>
</xs:appinfo>
</xs:annotation>
</xs:element>
...
\p{InGreek} matches a character in the Greek Unicode block, but
such a category is incongruent with the Shift_JIS encoding. In short,
any Unicode block or category would not make sense against a non-Unicode
encoding; for example, \p{Lu} matches a character in the uppercase letter
category, but the list of Unicode characters in that category cannot
be easily compared to a Shift_JIS encoding.
Thanks in advance,
--
Jonathan W. Cranford
Senior Information Systems Engineer
The MITRE Corporation (http://www.mitre.org)
More information about the dfdl-wg
mailing list