[DFDL-WG] Action 236: decoding UTF-16 sequence with an unpaired surrogate in ICU.

Steve Hanson smh at uk.ibm.com
Tue Nov 12 09:03:08 EST 2013


Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 12/11/2013 13:55 -----

From:   Steve Hanson/UK/IBM
To:     Alex Wood1/UK/IBM at IBMGB, 
Date:   12/11/2013 12:19
Subject:        Re: decoding UTF-16 sequence with an unpaired surrogate in 
ICU.


Thanks Alex. 

So we can control what ICU does in this scenario using 
dfdl:encodingErrorPolicy in the expected way, as the DFDL spec says.

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848



From:   Alex Wood1/UK/IBM
To:     Steve Hanson/UK/IBM at IBMGB, 
Date:   12/11/2013 12:12
Subject:        decoding UTF-16 sequence with an unpaired surrogate in 
ICU.


So I coded a java program to test this in ICU4J

So when decoding in ICU it seems to class an unpaired UTF-16 surrogate as 
malformed input.

ICU API allows the programmer to specify the behaviour for malformed 
input.

ignore, replace or report the offending code point. 

default is to report it and therefore the decode would fail with an error.

the ICU4C api has similar options available.

test program:


public class test1 {

        /**
         * @param args
         */
        public static void main(String[] args) {
                // TODO Auto-generated method stub

                final byte[] byteArray = { (byte) 0xD8, 0x34, (byte) 0xDD, 
0x1E, (byte) 0xD8, 0x34};
 
                CharsetProvider cp = new CharsetProviderICU();
 
                CharsetDecoder decoder = cp.charsetForName("UTF-16"
).newDecoder();
                decoder.onMalformedInput(CodingErrorAction.IGNORE);
                decoder.reset();
                ByteBuffer bb = ByteBuffer.wrap(byteArray, 0, 6);
                CharBuffer cb = CharBuffer.allocate(6);
                CoderResult decodeResult = decoder.decode(bb, cb, true);

                if (decodeResult.isMalformed() || 
decodeResult.isUnmappable()) {
                        System.out.println("Error at " + bb.position() );
                } 
                System.out.println("Result" + cb.toString() );
 
        }
}



Kind Regards,

- Alex

Alex Wood - 
Software Engineer - 
WebSphere Message Broker Development
DFDL Development

MP 211, IBM UK Labs, Hursley Park, Winchester, Hants. SO21 2JN.
Tel: Internal 246272, External 01962 816272
Notes: Alex Wood1/UK/IBM at IBMGB
e-mail: wooda at uk.ibm.com


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20131112/f186771c/attachment.html>


More information about the dfdl-wg mailing list