[DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode

Thu Oct 4 18:18:45 EDT 2012

An important use case for DFDL is converting legacy data to/from XML.

XML 1.0 disallows a bunch of string characters.

If the data contains those characters, then the question arises of what to
turn them into that both preserves information content, but also is legal
in XML so that you can convert the DFDL infoset into XML without violating
XML's constraints.

The natural thing to do is create an element containing the character code
of the illegal character, as an integer.

E.g., character code U+0001 would become an element. Such as:
<ccode>1</ccode>.

This could be done using a hidden element that is a string, and the element
ccode above would have an inputValueCalc that converts the offending
character of that string into an integer.

But we need a function dfdl:characterCode(str, pos) : int

The arguments would be a string, and a position (base 1) within that
string, and the return result would be the character code of the character
in the string at that position. If pos is out of the bounds of the string
(i.e., is negative, 0, or too large), then a processing error would occur.

For unparsing the inverse function would also be needed:
dfdl:character(intArg) : string. This would return a string containing one
character whose codepoint is the intArg.

Example

Consider this data:
    123<0>456<1>789<2>123l
where <0> means just one character with codepoint  0, etc.

In hex that would be 313233 00 343536 01 373839 02 313233

The best I can think of for modeling this while preserving all information
would end up with XML looking like this:

<nonXMLString>
<fragment><stringData>123</stringData></fragment>
<fragment><nonXMLChar><charCode>0</charCode></nonXMLChar></fragment>
<fragment><stringData>456</stringData></fragment>
<fragment><nonXMLChar><charCode>1</charCode></nonXMLChar></fragment>
<fragment><stringData>789</stringData></fragment>
<fragment><nonXMLChar><charCode>2</charCode></nonXMLChar></fragment>
<fragment><stringData>123</stringData></fragment>
</nonXMLString>

So our nonXMLString is of a type which is array of fragment, a fragment is
a choice of either (legal XML) stringData, or a nonXMLChar.

The nonXMLChar has a child element because it will need to convert to from
a string so will use inputValueCalc and outputValueCalc to do so, so it
needs to be a sequence so that it can have the other hidden elements needed
to pull this off.

stringData would have lengthKind="pattern" and a pattern that allows any
sequence of XML-allowed characters.

nonXMLChar would have a hidden first child element of type string of
explicit length 1 with an assertion that the string match a pattern that is
any of the illegal characters (but just one of them). The charCode child
element would inputValueCalc to get the character code of the character.
For 8 bit encodings it would be ok as a table lookup in XPath, but for
unicode..... we'd need a function that returns a character code.

If you just have one embedded illegal character, like NUL, then you could
just model it as a separator, which would simplify things considerably (and
is possible in a someday XML 1.1 future since NUL is then the only
disallowed character.)

But for XML 1.0's illegal characters, we need to be able to convert to/from
some non-string representation if we are to preserve information content.
Hence we need these additional functions.

-- 
Mike Beckerle | OGF DFDL WG Co-Chair
Tel:  781-330-0412
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20121004/c7558865/attachment.html>