[DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode

Suman Kalia kalia at ca.ibm.com
Thu Nov 1 11:13:54 EDT 2012


Agree . XML is an important use case..   We certainly want to provide 
guide guidance to the user on how to do.  If we can standardize, it would 
be great... 

Suman Kalia
IBM Canada Lab
WMB Toolkit Architect and Development Lead
Tel: 905-413-3923 T/L 313-3923
Email: kalia at ca.ibm.com

For info on Message broker
http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.html





From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     Suman Kalia/Toronto/IBM at IBMCA, 
Cc:     Steve Hanson <smh at uk.ibm.com>, dfdl-wg at ogf.org
Date:   11/01/2012 11:08 AM
Subject:        Re: [DFDL-WG] proposal: DFDL needs additional function 
dfdl:characterCode



I think all implementations will have to solve this problem. XML 
interchange is an important use case.

The question is just whether the DFDL standard says exactly how to do it 
or we leave it up to implementations and standardize an approach later.

...mike

On Thu, Nov 1, 2012 at 9:25 AM, Suman Kalia <kalia at ca.ibm.com> wrote:
 I have an xsd element of type string , the text data pertaining to it  is 
as in Mike's example ..  What I infer from your note is that in DFDL 
infoset , the string will appear as such    ie. 123<0>456<1>789<2>123l.   
 It is up to the user to parse this string and handle syntactic characters 
if he wants to render this in XML ??   

  
Suman Kalia 
IBM Canada Lab 
WMB Toolkit Architect and Development Lead 
Tel: 905-413-3923 T/L 313-3923 
Email: kalia at ca.ibm.com 

For info on Message broker 
http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.html 






From:        Steve Hanson <smh at uk.ibm.com> 
To:        Suman Kalia/Toronto/IBM at IBMCA, 
Cc:        dfdl-wg at ogf.org 
Date:        11/01/2012 09:07 AM 
Subject:        Re: [DFDL-WG] proposal: DFDL needs additional       
 function        dfdl:characterCode 




If you use XML-specific entity references then you have forced all 
consumers of the DFDL Infoset to be XML aware. The DFDL infoset is 
(deliberately) not an XML infoset. If I am parsing a string that contains 
x'08' there is nothing intrinsically wrong with that code point. It's only 
a problem if it is subsequently serialised as XML.  (If we wanted to have 
an XML focus to the DFDL infoset then we would have gone down the XDM 
route, an approach which was rejected). 

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848 



From:        Suman Kalia <kalia at ca.ibm.com> 
To:        Steve Hanson/UK/IBM at IBMGB, 
Cc:        dfdl-wg at ogf.org, dfdl-wg-bounces at ogf.org, Mike Beckerle <
mbeckerle.dfdl at gmail.com> 
Date:        01/11/2012 12:47 
Subject:        Re: [DFDL-WG] proposal: DFDL needs additional       
 function        dfdl:characterCode 



Shouldn't we be using entity references for XML syntactic character  found 
in text/binary data while creating info set and vice versa... 

Suman Kalia 
IBM Canada Lab 
WMB Toolkit Architect and Development Lead 
Tel: 905-413-3923 T/L 313-3923 
Email: kalia at ca.ibm.com 

For info on Message broker 
http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.html 






From:        Steve Hanson <smh at uk.ibm.com> 
To:        Mike Beckerle <mbeckerle.dfdl at gmail.com>, 
Cc:        dfdl-wg at ogf.org 
Date:        11/01/2012 07:57 AM 
Subject:        Re: [DFDL-WG] proposal: DFDL needs additional       
 function        dfdl:characterCode 
Sent by:        dfdl-wg-bounces at ogf.org 



>From WG call minutes 2012-10-30: 

"Beyond the scope of DFDL 1.0.  Assumption for now is that infoset needs 
post-processing." 

Mike has observed that other software systems  "map the illegal characters 
to/from the Unicode Private Use Area." 

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848 



From:        Mike Beckerle <mbeckerle.dfdl at gmail.com> 
To:        dfdl-wg at ogf.org, 
Date:        04/10/2012 23:39 
Subject:        [DFDL-WG] proposal: DFDL needs additional function       
 dfdl:characterCode 
Sent by:        dfdl-wg-bounces at ogf.org 




An important use case for DFDL is converting legacy data to/from XML.

XML 1.0 disallows a bunch of string characters.

If the data contains those characters, then the question arises of what to 
turn them into that both preserves information content, but also is legal 
in XML so that you can convert the DFDL infoset into XML without violating 
XML's constraints.

The natural thing to do is create an element containing the character code 
of the illegal character, as an integer.

E.g., character code U+0001 would become an element. Such as: 
 <ccode>1</ccode>.

This could be done using a hidden element that is a string, and the 
element ccode above would have an inputValueCalc that converts the 
offending character of that string into an integer.

But we need a function dfdl:characterCode(str, pos) : int

The arguments would be a string, and a position (base 1) within that 
string, and the return result would be the character code of the character 
in the string at that position. If pos is out of the bounds of the string 
(i.e., is negative, 0, or too large), then a processing error would occur. 


For unparsing the inverse function would also be needed: 
dfdl:character(intArg) : string. This would return a string containing one 
character whose codepoint is the intArg.

Example 

Consider this data:
 123<0>456<1>789<2>123l
where <0> means just one character with codepoint  0, etc.

In hex that would be 313233 00 343536 01 373839 02 313233

The best I can think of for modeling this while preserving all information 
would end up with XML looking like this:

<nonXMLString>
<fragment><stringData>123</stringData></fragment> 
<fragment><nonXMLChar><charCode>0</charCode></nonXMLChar></fragment>
<fragment><stringData>456</stringData></fragment>
<fragment><nonXMLChar><charCode>1</charCode></nonXMLChar></fragment>
<fragment><stringData>789</stringData></fragment>
<fragment><nonXMLChar><charCode>2</charCode></nonXMLChar></fragment>
<fragment><stringData>123</stringData></fragment>
</nonXMLString>

So our nonXMLString is of a type which is array of fragment, a fragment is 
a choice of either (legal XML) stringData, or a nonXMLChar. 

The nonXMLChar has a child element because it will need to convert to from 
a string so will use inputValueCalc and outputValueCalc to do so, so it 
needs to be a sequence so that it can have the other hidden elements 
needed to pull this off.

stringData would have lengthKind="pattern" and a pattern that allows any 
sequence of XML-allowed characters. 

nonXMLChar would have a hidden first child element of type string of 
explicit length 1 with an assertion that the string match a pattern that 
is any of the illegal characters (but just one of them). The charCode 
child element would inputValueCalc to get the character code of the 
character. For 8 bit encodings it would be ok as a table lookup in XPath, 
but for unicode..... we'd need a function that returns a character code. 

If you just have one embedded illegal character, like NUL, then you could 
just model it as a separator, which would simplify things considerably 
(and is possible in a someday XML 1.1 future since NUL is then the only 
disallowed character.)

But for XML 1.0's illegal characters, we need to be able to convert 
to/from some non-string representation if we are to preserve information 
content. Hence we need these additional functions. 

-- 
Mike Beckerle | OGF DFDL WG Co-Chair 
Tel:  781-330-0412
--
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
--
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 


--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg



-- 
Mike Beckerle | OGF DFDL WG Co-Chair 
Tel:  781-330-0412

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20121101/acb506e6/attachment-0001.html>


More information about the dfdl-wg mailing list