[DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode

Mike Beckerle mbeckerle.dfdl at gmail.com
Thu Nov 1 11:01:49 EDT 2012


I think all implementations will have to solve this problem. XML
interchange is an important use case.

The question is just whether the DFDL *standard* says exactly how to do it
or we leave it up to implementations and standardize an approach later.

...mike

On Thu, Nov 1, 2012 at 9:25 AM, Suman Kalia <kalia at ca.ibm.com> wrote:

>  I have an xsd element of type string , the text data pertaining to it  is
> as in Mike's example ..  What I infer from your note is that in DFDL
> infoset , the string will appear as such    ie. 123<0>456<1>789<2>123l.
>  It is up to the user to parse this string and handle syntactic characters
> if he wants to render this in XML ??
>
>
> Suman Kalia
> IBM Canada Lab
> WMB Toolkit Architect and Development Lead
> Tel: 905-413-3923 T/L 313-3923
> Email: kalia at ca.ibm.com
>
> For info on Message broker
>
> http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.html
>
>
>
>
>
> From:        Steve Hanson <smh at uk.ibm.com>
> To:        Suman Kalia/Toronto/IBM at IBMCA,
> Cc:        dfdl-wg at ogf.org
> Date:        11/01/2012 09:07 AM
> Subject:        Re: [DFDL-WG] proposal: DFDL needs additional
>  function        dfdl:characterCode
> ------------------------------
>
>
>
> If you use XML-specific entity references then you have forced all
> consumers of the DFDL Infoset to be XML aware. The DFDL infoset is
> (deliberately) not an XML infoset. If I am parsing a string that contains
> x'08' there is nothing intrinsically wrong with that code point. It's only
> a problem if it is subsequently serialised as XML.  (If we wanted to have
> an XML focus to the DFDL infoset then we would have gone down the XDM
> route, an approach which was rejected).
>
> Regards
>
> Steve Hanson
> Architect, Data Format Description Language (DFDL)
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> IBM SWG, Hursley, UK*
> **smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
>
>
>
> From:        Suman Kalia <kalia at ca.ibm.com>
> To:        Steve Hanson/UK/IBM at IBMGB,
> Cc:        dfdl-wg at ogf.org, dfdl-wg-bounces at ogf.org, Mike Beckerle <
> mbeckerle.dfdl at gmail.com>
> Date:        01/11/2012 12:47
> Subject:        Re: [DFDL-WG] proposal: DFDL needs additional
>  function        dfdl:characterCode
>  ------------------------------
>
>
>
> Shouldn't we be using entity references for XML syntactic character  found
> in text/binary data while creating info set and vice versa...
>
> Suman Kalia
> IBM Canada Lab
> WMB Toolkit Architect and Development Lead
> Tel: 905-413-3923 T/L 313-3923
> Email: kalia at ca.ibm.com
>
> For info on Message broker *
> **
> http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.html
> *<http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.html>
>
>
>
>
>
> From:        Steve Hanson <smh at uk.ibm.com>
> To:        Mike Beckerle <mbeckerle.dfdl at gmail.com>,
> Cc:        dfdl-wg at ogf.org
> Date:        11/01/2012 07:57 AM
> Subject:        Re: [DFDL-WG] proposal: DFDL needs additional
>  function        dfdl:characterCode
> Sent by:        dfdl-wg-bounces at ogf.org
>  ------------------------------
>
>
>
> From WG call minutes 2012-10-30:
>
> "Beyond the scope of DFDL 1.0.  Assumption for now is that infoset needs
> post-processing."
>
> Mike has observed that other software systems  "map the illegal
> characters to/from the Unicode Private Use Area."
>
> Regards
>
> Steve Hanson
> Architect, Data Format Description Language (DFDL)
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> IBM SWG, Hursley, UK*
> **smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
>
>
>
> From:        Mike Beckerle <mbeckerle.dfdl at gmail.com>
> To:        dfdl-wg at ogf.org,
> Date:        04/10/2012 23:39
> Subject:        [DFDL-WG] proposal: DFDL needs additional function
>  dfdl:characterCode
> Sent by:        dfdl-wg-bounces at ogf.org
>  ------------------------------
>
>
>
>
> An important use case for DFDL is converting legacy data to/from XML.
>
> XML 1.0 disallows a bunch of string characters.
>
> If the data contains those characters, then the question arises of what to
> turn them into that both preserves information content, but also is legal
> in XML so that you can convert the DFDL infoset into XML without violating
> XML's constraints.
>
> The natural thing to do is create an element containing the character code
> of the illegal character, as an integer.
>
> E.g., character code U+0001 would become an element. Such as:
>  <ccode>1</ccode>.
>
> This could be done using a hidden element that is a string, and the
> element ccode above would have an inputValueCalc that converts the
> offending character of that string into an integer.
>
> But we need a function dfdl:characterCode(str, pos) : int
>
> The arguments would be a string, and a position (base 1) within that
> string, and the return result would be the character code of the character
> in the string at that position. If pos is out of the bounds of the string
> (i.e., is negative, 0, or too large), then a processing error would occur.
>
> For unparsing the inverse function would also be needed:
> dfdl:character(intArg) : string. This would return a string containing one
> character whose codepoint is the intArg.
>
> Example
>
> Consider this data:
>  123<0>456<1>789<2>123l
> where <0> means just one character with codepoint  0, etc.
>
> In hex that would be 313233 00 343536 01 373839 02 313233
>
> The best I can think of for modeling this while preserving all information
> would end up with XML looking like this:
>
> <nonXMLString>
> <fragment><stringData>123</stringData></fragment>
> <fragment><nonXMLChar><charCode>0</charCode></nonXMLChar></fragment>
> <fragment><stringData>456</stringData></fragment>
> <fragment><nonXMLChar><charCode>1</charCode></nonXMLChar></fragment>
> <fragment><stringData>789</stringData></fragment>
> <fragment><nonXMLChar><charCode>2</charCode></nonXMLChar></fragment>
> <fragment><stringData>123</stringData></fragment>
> </nonXMLString>
>
> So our nonXMLString is of a type which is array of fragment, a fragment is
> a choice of either (legal XML) stringData, or a nonXMLChar.
>
> The nonXMLChar has a child element because it will need to convert to from
> a string so will use inputValueCalc and outputValueCalc to do so, so it
> needs to be a sequence so that it can have the other hidden elements needed
> to pull this off.
>
> stringData would have lengthKind="pattern" and a pattern that allows any
> sequence of XML-allowed characters.
>
> nonXMLChar would have a hidden first child element of type string of
> explicit length 1 with an assertion that the string match a pattern that is
> any of the illegal characters (but just one of them). The charCode child
> element would inputValueCalc to get the character code of the character.
> For 8 bit encodings it would be ok as a table lookup in XPath, but for
> unicode..... we'd need a function that returns a character code.
>
> If you just have one embedded illegal character, like NUL, then you could
> just model it as a separator, which would simplify things considerably (and
> is possible in a someday XML 1.1 future since NUL is then the only
> disallowed character.)
>
> But for XML 1.0's illegal characters, we need to be able to convert
> to/from some non-string representation if we are to preserve information
> content. Hence we need these additional functions.
>
> --
> Mike Beckerle | OGF DFDL WG Co-Chair
> Tel:  *781-330-0412* <781-330-0412>
> --
> dfdl-wg mailing list
> dfdl-wg at ogf.org*
> **https://www.ogf.org/mailman/listinfo/dfdl-wg*<https://www.ogf.org/mailman/listinfo/dfdl-wg>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
> --
> dfdl-wg mailing list
> dfdl-wg at ogf.org*
> **https://www.ogf.org/mailman/listinfo/dfdl-wg*<https://www.ogf.org/mailman/listinfo/dfdl-wg>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
> --
>   dfdl-wg mailing list
>   dfdl-wg at ogf.org
>   https://www.ogf.org/mailman/listinfo/dfdl-wg
>



-- 
Mike Beckerle | OGF DFDL WG Co-Chair
Tel:  781-330-0412
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20121101/a5c51f06/attachment-0001.html>


More information about the dfdl-wg mailing list