[DFDL-WG] matching DFDL regular expressions against different encoding

Cranford, Jonathan W. jcranford at mitre.org
Fri Jul 5 18:03:43 EDT 2013


Next question:  What does it mean to match a regular expression written in one encoding system

    against a different encoding system?



For example:



    <?xml encoding="UTF-8"?>

    ...

    <xs:element name="foo" type="xs:string" dfdl:encoding="Shift_JIS">

      <xs:annotation>

        <xs:appinfo source=”http://www.ogf.org/dfdl/”>

            <dfdl:assert testKind="pattern" testPattern="\\"/>

        </xs:appinfo>

      </xs:annotation>

    </xs:element>

    ...



The only interpretation that makes sense to me is that the regular expression is *always* expressed in the encoding of the XML schema (UTF-8 in this case), since that’s what any XML parser will do; that is, any XML parser will interpret all element and attribute content according to the encoding of the XML file.  Otherwise, to have a regular expression expressed in the target encoding represented by dfdl:encoding (Shift_JIS in this example), you could not use an XML parser to parse the DFDL schema; you would have to use a custom parser that applies a *different* encoding to select attributes or elements in the DFDL schema, and that just seems contrary to the spirit and intent of DFDL.



However, this could lead to some very interesting problems.  A single string, testPattern, has to be compiled as a regular expression *and* converted to match another encoding.



The biggest question, I think, is how character literals and character classes in the regular expression should be treated.



Here, if you map the raw bytes, \ in UTF-8 is 0x5C, which maps to the yen character (\) in Shift_JIS.  That’s one approach.



Another approach is to map the logical character to its equivalent encoding in Shift_JIS.  Backslash (\) is 0x825F in Shift_JIS.  To do this across the board, the underlying library would need a pretty sophisticated mapping for each encoding.



So should the above match a yen character (0x5C) in the foo element?  Or should it match the backslash character (0x825F) in the foo element?



More complicated example using multi-byte characters:



    <?xml encoding="UTF-8"?>

    ...

    <xs:element name="foo" type="xs:string" dfdl:encoding="Shift_JIS">

      <xs:annotation>

        <xs:appinfo source=”http://www.ogf.org/dfdl/”>
            <dfdl:assert testKind="pattern"
testPattern="&#x305F;&#x306A;&#x304C;&#x3055;&#x3093;"/>

        </xs:appinfo>

      </xs:annotation>

    </xs:element>

    ...



The above characters in testPattern map to たなかさん in Unicode.



How should this match work?



a.      The code points can’t be converted directly to Shift_JIS, as it doesn’t represent a valid Shift_JIS encoding.  The bytes that correspond to a UTF-8 encoding of that same string also don’t represent a valid Shift_JIS encoding. In neither case can the bytes be simply treated as Shift_JIS, so this approach should cause a processing error.

b.      Mapping each character to its logical equivalent would work, but the underlying library would have to know how to map each character from one encoding to another, and sometimes a single character from one encoding can be mapped to more than character in another encoding.  In this case, a single mapping does exist; the equivalent characters in Shift_JIS encoding are encoded as

               82BD 82C8 82A9 82B3 82F1.



Thoughts?



--

Jonathan W. Cranford <jcranford at mitre.org>

Senior Information Systems Engineer

The MITRE Corporation (http://www.mitre.org)




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130705/ad4b9fad/attachment.html>


More information about the dfdl-wg mailing list