[DFDL-WG] Glossary items needed (for a v12 errata?)

Tue Jan 29 14:59:02 EST 2013

Revised per discussion on call 2013-01-29.

We agreed to change our terminology to align with the Unicode standard
terminology. Alas, it isn't that simple, as Unicode's terminology is a
little incompatible with XML terminology ('encoding'), and with IANA
terminology "character set". The Unicode terminology also draws
distinctions that we don't really need.

I have also retained the term 'Character code' to mean the cannonical
unicode integer for a character, which is the same as the ISO10646 code
point for a character.

All appearances of "codepoint" will change to "code unit" consistent with
the Unicode glossary. We do not need the term "Code Point" after that, so
I've dropped it.

The term 'code page'  appears only once in the standard, and can be changed
to 'character set encoding' there. So I suggest we stick with CCSID, and
drop the term 'code page'.

Here is the revised set of definitions in alphabetical order:

CCSID - see *Coded Character Set Identifier*

Character - A ISO10646 character having a unique *character code* as its
identifier. This concept is independent of font, typeface, size, and style,
so '*F*', '*F*', '*F*', are all the same character 'F'

Character Code - The canonical integer used to identify a character in the
ISO10646 standards. This number identifies the character, but can be
independent of any specific character set encoding of the character.
Example: The '{' character known in Unicode as LEFT CURLY BRACKET. Has
character code U+007B. However, depending on the *character set encoding*,
the value 0x7B may or may not appear in the representation of that
character.

Character Set - An abstract set of characters that are assigned (or *mapped
to)* a representation by a particular *character set encoding*. For most
character set encodings their character set is a subset of the Unicode
character set.

Character Set Encoding - Often abbreviated to just 'encoding'. A specific
representation of a character set as bytes or bits of data. A character set
encoding is usually identified by a standard character set encoding name or
a recognized alias name, or by a *coded character set identifier or CCSID*.
These identifiers are standardized. The names and aliases are standardized
by the IANA (where unfortunately, they are called character set names).
CCSIDs are an industry standard. Examples of character set encoding names
are UTF-8, USASCII, GB2312, ebcdic-cp-it,  ISO-8859-5, UTF-16BE, Shift_JIS.
The DFDL standard allows for implementation-specific character set
encodings to be supported, and standardizes one name that is DFDL-specific
which is USASCII-7bit-packed.

Character Width - The number of code units or alernatively the number of
bytes used to represent a character in a specific character set encoding is
called the character width. Encodings are either fixed width (all
characters encoded using the same width), or variable-width (different
characters are encoded using different widths). For example the UTF-32
character set encoding has 4-byte character width, whereas USASCII has a
1-byte character width. UTF-8 is variable width, and any specific character
has width 1, 2, 3, or 4 bytes.

Code Unit - When a character set encoding uses differing *variable
width*representations for characters, the units making up these
variable width
representations are called *code units*. For example the UTF-8 encoding
uses between 1 and 4 code units to represent characters, and for UTF-8, the
individual code units are single bytes. DFDL's interpretation of the UTF-16
encoding is either fixed or variable width. When format property
dfdl:utf16Width='variable' then UTF-16 is variable width and this encoding
uses either one or two code units per character, but in this case each
individual code unit is a 16-bit value. When a character set is fixed
width, then there is no distinction between a code unit and a code point.

Coded Character Set Identifier (CCSID) - An alternate identifier of a
character set encoding. Originally created by IBM, CCSIDs are a broadly
used industry standard.

Encoding - See *Character Set Encoding*

Fixed-Width Character Encoding - A character set encoding where all
characters are encoded using a single code unit for their representation.
Note that a code unit is not necessarily a single byte.

Surrogate Pair - A Unicode character whose character code value is greater
than 0xFFFF can be encoded into variable-width UTF-16BE or UTF-16LE (which
are variable-width encodings when the DFDL property utf16Width='variable').
In this case the representation uses two adjacent *code units *each of
which is called a surrogate, and the pair of which is called a surrogate
pair.

Unicode - A character set defined by the Unicode Consortium, and
standardized at the International Standards Organization (ISO) as ISO10646.

Variable-Width Character Encoding - A character set encoding where
characters are encoded using one or more code units for their
representation depending on which specific character is being encoded. An
example is UTF-8 which uses from 1 to 4 bytes to encode a character.

On Mon, Jan 28, 2013 at 10:57 AM, Mike Beckerle <mbeckerle.dfdl at gmail.com>wrote:

>
> We need to add some entries associated with character set and encoding
> terminology that we use quite a bit.
>
> I would note that our usage of the term 'codepoint' differs somewhat from
> the Unicode Glossary: http://unicode.org/glossary. First, we use
> codepoint as one word not "code point" (there was some inconsistency on
> this that I have now fixed), second, what we call codepoint is closer to
> what Unicode Glossary calls 'code unit'. I suspect we should just provide
> our definitions rather than switching terms, but I'm open to it if we want
> to convert all uses of codepoint to "code unit".
>
> Encoding - See *Character Set Encoding*
>
> Codepoint - When a *character set encoding* uses differing *variable width
> * representations for characters, the units making up these variable
> width representations are called codepoints. For example the UTF-8 encoding
> uses between 1 and 4 codepoints to represent characters, and for UTF-8, the
> codepoints are single bytes. The UTF-16 encoding is either fixed or
> variable width. When dfdl:utf16Width='variable' this encoding uses either
> one or two codepoints per character and each codepoint is a 16-bit value.
> When a character set is fixed width, then there is no distinction between a
> codepoint and a character code.
>
> Code page - An alternate identifier for a Character Set Encoding.
>
> Character Code - The numeric value assigned to a character in a character
> set that is independent of any specific encoding of that character set. For
> any fixed-size encoding (all characters have the same size representation)
>
> Character Set - An abstract set of characters independent of any specific
> encoding scheme: Examples are the Unicode character set, or the USASCII
> character set.
>
> Character Set Encoding - A specific representation of a character set as
> bytes or bits of data. A character set encoding is usually identified by a
> standard character set name or a recognized alias name, or by a *code page
> * identifier. These identifiers are standardized by the IANA. Examples
> are UTF-8, USASCII, GB2312, ebcdic-cp-it,  ISO-8859-5, UTF-16BE, Shift_JIS.
> The DFDL standard allows for implementation-specific character set
> encodings to be supported, and standardizes one name that is DFDL-specific
> which is USASCII-7bit-packed.
>
> Character Width - The number of codepoints or bytes used to represent a
> character in a specific character set encoding is called the character
> width. Encodings are either fixed width (all characters encoded using the
> same width), or variable-width (different characters are encoded using
> different widths). For example the UTF-32 character set encoding has 4-byte
> character width, whereas USASCII has a 1-byte character width.
>
> Fixed-Width Character Encoding - A character set encoding where all
> characters are encoded using a single codepoint for their representation.
> Note that a codepoint may take up one or more bytes.
>
> Surrogate Pair - A Unicode character whose character code value is greater
> than 0xFFFF can be encoded into variable-width UTF-16BE or UTF-16LE which
> are variable-width encodings when the DFDL property utf16Width='variable'.
> In this case the representation uses two adjacent *codepoints *each of
> which is called a surrogate, and the pair of which is called a surrogate
> pair.
>
> Variable-Width Character Encoding - A character set encoding where
> characters are encoded using one or more codepoints for their
> representation depending on which specific character is being encoded. An
> example is UTF-8 which uses from 1 to 4 bytes to encode a character.
>
>
>
> ...mike
>
> --
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
> www.tresys.com
>
>

-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130129/96532e4f/attachment-0001.html>