[DFDL-WG] Actions 290 & 291 - variable width characters and utf16Width 'variable'

Mike Beckerle mbeckerle.dfdl at gmail.com
Tue Nov 8 14:27:59 EST 2016


I verified that Java's representation of unicode codepoints above U+FFFF is

(a) an int - for a unicode character code handled outside of a string.
(b) a pair of surrogate codepoints when represented in a java String

There are a variety of methods now that take or return an int which can be
up to U+10FFFF, and which interact with either 1 or 2 character codepoints
of a String.

For utf-16:  A character requiring more than 16 bits is represented as 2
code units of a surrogate pair, and each of those becomes a Java character.

When utf16Width is 'variable' then this surrogate pair counts as 1 unicode
character for length in 'characters' purposes. This is the feature I
believe should be optional in DFDL.

For utf-8: A character requiring more than 16 bits is also represented as 2
code units of a surrogate pair.

However, we have no property for indicating this surrogate pair counts as 1
unicode character for length in 'characters' units purposes.

For utf-32: Same issue. A single codepoint in utf-32 may have to be
represented as a surrogate pair in a java string.

However, we have no property for indicating this surrogate pair counts as 1
unicode character.

...mikeb


Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
<http://www.ogf.org/About/abt_policies.php>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20161108/c4bb4b20/attachment.html>


More information about the dfdl-wg mailing list