[DFDL-WG] Fwd: proposal: DFDL needs additional function dfdl:characterCode
Steve Hanson
smh at uk.ibm.com
Tue May 24 06:59:41 EDT 2016
To be discussed on DFDL WG call today.
Regards
Steve Hanson
IBM Integration Bus, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
From: Mike Beckerle <mbeckerle.dfdl at gmail.com>
To: Steve Hanson/UK/IBM at IBMGB
Date: 10/05/2016 16:57
Subject: Fwd: [DFDL-WG] proposal: DFDL needs additional function
dfdl:characterCode
This is the thread that mentions dfdl:characterCode(string, pos): int, and
dfdl:character(charCodeInt): String proposed functions.
It also discusses the issue of XML-illegal characters in the XML
representation of the DFDL infoset.
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
---------- Forwarded message ----------
From: Tim Kimber <KIMBERT at uk.ibm.com>
Date: Thu, Nov 1, 2012 at 12:00 PM
Subject: Re: [DFDL-WG] proposal: DFDL needs additional function
dfdl:characterCode
To: dfdl-wg at ogf.org
Correct - there is no way at all of using illegal characters in an XML
document. Not CDATA, not character entities. They simply must not appear
anywhere.
I agree with Steve that XML compatibility is not the only requirement for
a DFDL info set - we should not do anything to make it XML specific in a
way that harms it generality. To balance that, I also think that the DFDL
Working Group should be paying attention to the issues around XML
compatibility, given that DFDL is based on XML Schema and many potential
adopters of DFDL will want to know about XML compatibility.
Mike's proposal of mapping illegal characters into the Unicode Private Use
Area sounds like a reasonable approach for implementers to use.
regards,
Tim Kimber, DFDL Team,
Hursley, UK
Internet: kimbert at uk.ibm.com
Tel. 01962-816742
Internal tel. 37246742
From: Mike Beckerle <mbeckerle.dfdl at gmail.com>
To: Suman Kalia <kalia at ca.ibm.com>,
Cc: dfdl-wg at ogf.org, dfdl-wg-bounces at ogf.org
Date: 01/11/2012 15:14
Subject: Re: [DFDL-WG] proposal: DFDL needs additional function
dfdl:characterCode
Sent by: dfdl-wg-bounces at ogf.org
Turns out the XML char entities are not an escape scheme for putting
illegal chars in. E.g. � is illegal even expressed that way.
The char entities are essentially an internationalization hack so you can
enter and render any legal character using only a small charset.
On Nov 1, 2012 8:48 AM, "Suman Kalia" <kalia at ca.ibm.com> wrote:
Shouldn't we be using entity references for XML syntactic character found
in text/binary data while creating info set and vice versa...
Suman Kalia
IBM Canada Lab
WMB Toolkit Architect and Development Lead
Tel: 905-413-3923 T/L 313-3923
Email: kalia at ca.ibm.com
For info on Message broker
http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.html
From: Steve Hanson <smh at uk.ibm.com>
To: Mike Beckerle <mbeckerle.dfdl at gmail.com>,
Cc: dfdl-wg at ogf.org
Date: 11/01/2012 07:57 AM
Subject: Re: [DFDL-WG] proposal: DFDL needs additional
function dfdl:characterCode
Sent by: dfdl-wg-bounces at ogf.org
>From WG call minutes 2012-10-30:
"Beyond the scope of DFDL 1.0. Assumption for now is that infoset needs
post-processing."
Mike has observed that other software systems "map the illegal characters
to/from the Unicode Private Use Area."
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
From: Mike Beckerle <mbeckerle.dfdl at gmail.com>
To: dfdl-wg at ogf.org,
Date: 04/10/2012 23:39
Subject: [DFDL-WG] proposal: DFDL needs additional function
dfdl:characterCode
Sent by: dfdl-wg-bounces at ogf.org
An important use case for DFDL is converting legacy data to/from XML.
XML 1.0 disallows a bunch of string characters.
If the data contains those characters, then the question arises of what to
turn them into that both preserves information content, but also is legal
in XML so that you can convert the DFDL infoset into XML without violating
XML's constraints.
The natural thing to do is create an element containing the character code
of the illegal character, as an integer.
E.g., character code U+0001 would become an element. Such as:
<ccode>1</ccode>.
This could be done using a hidden element that is a string, and the
element ccode above would have an inputValueCalc that converts the
offending character of that string into an integer.
But we need a function dfdl:characterCode(str, pos) : int
The arguments would be a string, and a position (base 1) within that
string, and the return result would be the character code of the character
in the string at that position. If pos is out of the bounds of the string
(i.e., is negative, 0, or too large), then a processing error would occur.
For unparsing the inverse function would also be needed:
dfdl:character(intArg) : string. This would return a string containing one
character whose codepoint is the intArg.
Example
Consider this data:
123<0>456<1>789<2>123l
where <0> means just one character with codepoint 0, etc.
In hex that would be 313233 00 343536 01 373839 02 313233
The best I can think of for modeling this while preserving all information
would end up with XML looking like this:
<nonXMLString>
<fragment><stringData>123</stringData></fragment>
<fragment><nonXMLChar><charCode>0</charCode></nonXMLChar></fragment>
<fragment><stringData>456</stringData></fragment>
<fragment><nonXMLChar><charCode>1</charCode></nonXMLChar></fragment>
<fragment><stringData>789</stringData></fragment>
<fragment><nonXMLChar><charCode>2</charCode></nonXMLChar></fragment>
<fragment><stringData>123</stringData></fragment>
</nonXMLString>
So our nonXMLString is of a type which is array of fragment, a fragment is
a choice of either (legal XML) stringData, or a nonXMLChar.
The nonXMLChar has a child element because it will need to convert to from
a string so will use inputValueCalc and outputValueCalc to do so, so it
needs to be a sequence so that it can have the other hidden elements
needed to pull this off.
stringData would have lengthKind="pattern" and a pattern that allows any
sequence of XML-allowed characters.
nonXMLChar would have a hidden first child element of type string of
explicit length 1 with an assertion that the string match a pattern that
is any of the illegal characters (but just one of them). The charCode
child element would inputValueCalc to get the character code of the
character. For 8 bit encodings it would be ok as a table lookup in XPath,
but for unicode..... we'd need a function that returns a character code.
If you just have one embedded illegal character, like NUL, then you could
just model it as a separator, which would simplify things considerably
(and is possible in a someday XML 1.1 future since NUL is then the only
disallowed character.)
But for XML 1.0's illegal characters, we need to be able to convert
to/from some non-string representation if we are to preserve information
content. Hence we need these additional functions.
--
Mike Beckerle | OGF DFDL WG Co-Chair
Tel: 781-330-0412
--
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
--
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg --
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
--
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20160524/16d66538/attachment-0001.html>
More information about the dfdl-wg
mailing list