[DFDL-WG] Glossary items needed (for a v12 errata?)

Mon Jan 28 11:23:42 EST 2013

Mike,

I agree with your analysis.

re: "I suspect we should just provide our definitions rather than 
switching terms, but I'm open to it if we want to convert all uses of 
codepoint to "code unit"."
There is already a lot of confusion ( in the world of software ) around 
Unicode and its terminology. Our goal should be to use terminology that is 
consistent with Unicode's standard - otherwise I foresee a lot of 
opportunity for confusion, leading to divergent implementations of DFDL. I 
would prefer us to switch to the standard terms unless it's really 
painful.

regards,

Tim Kimber, DFDL Team,
Hursley, UK
Internet:  kimbert at uk.ibm.com
Tel. 01962-816742 
Internal tel. 37246742

From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     dfdl-wg at ogf.org, 
Date:   28/01/2013 15:57
Subject:        [DFDL-WG] Glossary items needed (for a v12 errata?)
Sent by:        dfdl-wg-bounces at ogf.org

We need to add some entries associated with character set and encoding 
terminology that we use quite a bit.

I would note that our usage of the term 'codepoint' differs somewhat from 
the Unicode Glossary: http://unicode.org/glossary. First, we use codepoint 
as one word not "code point" (there was some inconsistency on this that I 
have now fixed), second, what we call codepoint is closer to what Unicode 
Glossary calls 'code unit'. I suspect we should just provide our 
definitions rather than switching terms, but I'm open to it if we want to 
convert all uses of codepoint to "code unit".

Encoding - See Character Set Encoding

Codepoint - When a character set encoding uses differing variable width 
representations for characters, the units making up these variable width 
representations are called codepoints. For example the UTF-8 encoding uses 
between 1 and 4 codepoints to represent characters, and for UTF-8, the 
codepoints are single bytes. The UTF-16 encoding is either fixed or 
variable width. When dfdl:utf16Width='variable' this encoding uses either 
one or two codepoints per character and each codepoint is a 16-bit value. 
When a character set is fixed width, then there is no distinction between 
a codepoint and a character code.

Code page - An alternate identifier for a Character Set Encoding.

Character Code - The numeric value assigned to a character in a character 
set that is independent of any specific encoding of that character set. 
For any fixed-size encoding (all characters have the same size 
representation)

Character Set - An abstract set of characters independent of any specific 
encoding scheme: Examples are the Unicode character set, or the USASCII 
character set. 

Character Set Encoding - A specific representation of a character set as 
bytes or bits of data. A character set encoding is usually identified by a 
standard character set name or a recognized alias name, or by a code page 
identifier. These identifiers are standardized by the IANA. Examples are 
UTF-8, USASCII, GB2312, ebcdic-cp-it,  ISO-8859-5, UTF-16BE, Shift_JIS. 
The DFDL standard allows for implementation-specific character set 
encodings to be supported, and standardizes one name that is DFDL-specific 
which is USASCII-7bit-packed. 

Character Width - The number of codepoints or bytes used to represent a 
character in a specific character set encoding is called the character 
width. Encodings are either fixed width (all characters encoded using the 
same width), or variable-width (different characters are encoded using 
different widths). For example the UTF-32 character set encoding has 
4-byte character width, whereas USASCII has a 1-byte character width.

Fixed-Width Character Encoding - A character set encoding where all 
characters are encoded using a single codepoint for their representation. 
Note that a codepoint may take up one or more bytes.

Surrogate Pair - A Unicode character whose character code value is greater 
than 0xFFFF can be encoded into variable-width UTF-16BE or UTF-16LE which 
are variable-width encodings when the DFDL property utf16Width='variable'. 
In this case the representation uses two adjacent codepoints each of which 
is called a surrogate, and the pair of which is called a surrogate pair.  

Variable-Width Character Encoding - A character set encoding where 
characters are encoded using one or more codepoints for their 
representation depending on which specific character is being encoded. An 
example is UTF-8 which uses from 1 to 4 bytes to encode a character. 

...mike

-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com
--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130128/6931df67/attachment.html>