[DFDL-WG] dfdl-wg Digest, Vol 40, Issue 17 : Action 072 resolved
Tim Kimber
KIMBERT at uk.ibm.com
Fri Dec 18 05:29:19 CST 2009
072
TK: Byte Order Mark and Unicode signature
16/12: Investigate whether the spec's position on UTF-16/32 BOM is
implementable
The implementation team have carried out tests on the Java and C
implementations of ICU. The results are:
Java ICU libraries
Encoding
Input
BOM included in decoded string?
UTF-8
<BOM>AAA
yes
UTF-16
<BOM>AAA
yes
UTF-16-LE
<BOM>AAA
yes
UTF-16-BE
<BOM>AAA
yes
UTF-32
<BOM>AAA
no
UTF-32-LE
<BOM>AAA
no
UTF-32-BE
<BOM>AAA
no
C ICU libraries:
Encoding
Input
BOM included in decoded string?
UTF-8
<BOM>AAA
yes
UTF-16
<BOM>AAA
yes
UTF-16-LE
<BOM>AAA
yes
UTF-16-BE
<BOM>AAA
yes
UTF-32
<BOM>AAA
yes
UTF-32-LE
<BOM>AAA
yes
UTF-32-BE
<BOM>AAA
yes
I suspect that the UTF-32 anomaly is a defect in ICU. I tried to confirm
this using Google, but I didn't find any reference to it online.
Before we conclude that the spec is OK as it stands, we should consider
whether it is correct to treat a BOM as a character. The Unicode standard
makes a clear distinction between characters and BOMs:
http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf section 2.13 ( don't
skip the first couple of paragraphs )
http://unicode.org/faq/utf_bom.html#bom1
regards,
Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert at uk.ibm.com
Tel. 01962-816742
Internal tel. 246742
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20091218/a624de55/attachment.html
More information about the dfdl-wg
mailing list