12 Jan
2014
12 Jan
'14
1:02 a.m.
> - base 4096 (for UTF8 channels) I may reveal some crippling ignorance, but: UTF8 is an encoding system to allow for effectively infinite character extensions in binary text data. The original forms of binary encoding for text were merely to assign characters to a large segment of the 264 possibilities of a single byte; 2^8. To account for the fact that in early data transports many of these codepoints were considered instructions and could therefore inject transport-specific commands etc. (if I understand the problem correctly)*, base64/base32 were intended to allow arbitrary binary to be encoded into a transport that accepted text without including codepoints likely to have control significance. Amusingly, the base32/64 alphabets are restricted further to remove characters that might, if accidentally rearranged, cause people to see naughty words in binary. When it comes down to it, there are only 2^8 possibilities in binary. UTF8's extensions are indicated by additional byte sequences that indicate "the following bytes should be viewed as an extension". I'm not sure how many "ensuing" bytes can be regarded as an extended encoding at a time, but I think it's only in the range of 1-3. If we assume it's 3, and further assume that, after declaring that the following 3 bytes are an extension, that any arbitrary binary sequence will be interpreted as a visible, copy/pasteable character, then you're looking at a length penalty to encode arbitrary data of 33%. For every three bytes, you're escaping them to random characters by prefixing with another byte. Yes, it's more nuanced than that; you can factor in the ascii set and use that where possible, only escaping binary values outside the ascii set, but one way or another you're adding length to the binary string by messing with it, with the aim being a character-representable set of binary data that can be copy/pasted safely and passed through diverse transports. So the question is what's more important; ability to transport strings of data without a significant length penalty, ability to transport strings of arbitrary data without affecting the transport, or ability to copy/paste (a subset of "transport" I guess). Given these, my personal feeling is that if your concern is transport-related, which implies that you can't control the transport, then stick with base64. If your concern is length, then I don't feel UTF8 will offer a significant advantage, and you're much better off using something like length-prefixing like bencoding does it. If your concern is copy-pasteability, then base58 works and probably is no worse than base-utf8, while being significantly easier to implement in code. Spurious rant over. * Take for example the way early email was sent, where headers were specified and then the server awaited the body of the message, the end of which was indicated by what amounts to a string of characters; a newline, a period, and another newline. Easily injected by accident or design, along with other commands. On 11/01/14 06:27, nymble wrote: > > consistent key formats are critical, need to converge on: > - endianness > - coordinate representation x, x&y, x and sign … > or bits to show which of these …. perhaps borrow ANSI method > - hint / indication of cipher suite / curve > - text encoding of binary format (ascii) > - text encoding of binary format (utf8) > - human readable format > >> ecc public key curve p25519(pcp 0.15) > leaking crypto suite > key should be usable in other contexts besides pcp 0.15 > > >> 1l0$WoM5C8z=yeZG7?$]f^Uu8.g>4rf#t^6mfW9(rr910 > one of several possible text encodings > Others might include: > - base 29 > - base 59 > - base 4096 (for UTF8 channels) > > >