At 10:27 PM 1/10/2014, nymble wrote:
one of several possible text encodings Others might include: - base 29 - base 59 - base 4096 (for UTF8 channels)
The primary reasons for text encoding were that people wanted to transmit data through channels that might modify content or had limitations on the size and type of content, such as 7-bit ASCII, special interpretations of control characters, especially \r, \n, \0, \t, conversion to/from EBCDIC or other character sets, line length limitations, case-folding, multiple space compaction, parity bits, etc. A secondary goal is to support transcription by humans or optical character readers that are likely to make mistakes on some similar-looking characters, but that's much less common. A tertiary goal is that some programmers like to "improve" programs or make them "more efficient" by twiddling bits in ways that lead to software bugs, security holes, and the wrong kinds of chaos and anarchy, and yes I'm particularly including Phil Zimmerman and the standards committees who designed ASN.1 and DNS. To give those guys some slack, most of us started programming before the 8-bit byte was really universal and saving bytes here and there was *really* *important*.* The most common encodings out there encode most of the characters in base-16 (or octal, for old DEC applications) or base-64 (uuencode and MIME), with various wrappers around them to handle line-length limitations and sometimes checksums. Sadly, base-85 didn't catch on - it used 5 characters to hold 4 bytes, vs. base-64's 6 characters for 4 bytes, but it was late to the game and required doing multiplication and division instead of just bit-shifting. I've never seen base-29 or base-59 encodings - is base-29 some attempt to fit into 5-level Baudot coding now that the deaf community have pretty much all moved off Model-28 TTY emulators to ASCII or mobile phone texting? Base-4096 in UTF-8 would be silly - it gets you 12 bits per variable-width character, requiring at least two bytes, so you could just as well use two bytes of base-64 and not risk munging by systems that don't understand UTF-8. (* My first programming environment had a printer with 132 48-character type bars and Model 026 keypunches doing Hollerith cards, which could print 56 different characters; I don't think we did any hacks using non-printer-supported punchcard fields and the card sorter, but it was possible.)