One incentive would be for the BBS operators to phase in a policy that
they will accept no e-mail which is _not_ encrypted. Comments? << And how does your BBS software tell whether or not you've just sent encrypted mail, plaintext mail or line-noise? (in an encryption/decryption-at-user's-end scenario) -- Omega@spica.bu.edu
Re: distinguishing between encrypted mail, plaintext mail, and line-noise. I'm really glad this question came up. I passed over it before because I was more interested in the social issue, but the technical one is important. The basic technique is the foundation of cryptography: information theory. For this application, you can just measure the entropy; it alone should be able to distinguish between the three sources. The entropy measures how well one can statistically predict the output of a source. A random source has eight bits of entropy per byte. As randomness decreases, so does the entropy measure. (Mail me if you want references in order to learn this stuff yourself.) Now line noise, let's say, will appear random. So its entropy should be right near the maximum, 8 bits. Text encrypted with PGP using the ASCII armor uses only 64 characters out of 256 possible, or one fourth of the total available. Its entropy would be 2 bits per character. English text is usually around four and five bits per character, if I remember right. To calculate the entropy, you first make a table (of size 256) of character frequencies normalized to the range [0,1]. Call these p_i. The entropy is then (TeX here) $ \Sum_{i=0}^{256}n - p_i \log_2 p_i $. (The log base 2 give bits instead of natural units). Now see if this number is in one of the following ranges: [1.5 .. 2.5] encrypted text [3 .. 6] regular text [7 .. 8] line noise This is a very simple measure. There are other measures to look for the deviation from an expected distribution, which give much more accurate distinctions. One can very easily separate languages from each other just by looking at such measures. Note that none of these techniques ever look at the content. Nor do they look at digraph (two-letter combinations) or trigraph statistics. In fact, the content is completely destroyed by the scanning process! Lots of this stuff is known; this is how the big boys crack codes. I'm glad there arose a natural context to explain some of this stuff. Eric
[1.5 .. 2.5] encrypted text [3 .. 6] regular text [7 .. 8] line noise This is a very simple measure. There are other measures to look for the deviation from an expected distribution, which give much more accurate distinctions. One can very easily separate languages from each other just by looking at such measures. Where does uuencoded [compressed] binary lie? I would suspect it lies right around where encrypted text is. Presumably straight encrypted text is statistically random [7..8], but that when you8 encrypt to just the ascii subset is when you lose the entropy. dean
Dean:
Where does uuencoded [compressed] binary lie? I would suspect it lies right around where encrypted text is.
Right.
Presumably straight encrypted text is statistically random [7..8], but that when you encrypt to just the ascii subset is when you lose the entropy.
Exactly. uuencoding will have a slightly lower single-character entropy than the ASCII armor PGP uses because just about every line begins with the letter 'M'. This will skew the distribution slightly. But a better way of distinguishing uuencoding and ascii armor is to see that in falls in the same entropy class, and then just looking at the alphabetic subsets used. Eric
Re: entropy I seem to remember that English text is about 1.5 bits per character. I can find a reference if you're interested. e
uuencoding will have a slightly lower single-character entropy than the ASCII armor PGP uses because just about every line begins with the letter 'M'. This will skew the distribution slightly. But a better way of distinguishing uuencoding and ascii armor is to see that in falls in the same entropy class, and then just looking at the alphabetic subsets used.
It's not that simple. The entropy of a byte is the number of bits needed to represent it. If what is uuencoded is extremely repetitive, the entropy will be low, maybe even less than one. On the other hand, if it were random data, it would just be slightly lower than ascii armor. Binaries are somewhat repetitive, so they have somewhat less entropy than random data. English has a lot of redundancy, so it has a low entropy. e
Re: entropy Eric Hollander writes:
I seem to remember that English text is about 1.5 bits per character. I can find a reference if you're interested.
There are lots of entropies available to measure. There is "true" entropy, the lower bound for all other entropy measures. This is the compressibility limit. The entropy I was referring to was simply the single character entropy. That is, the probabilities p_i in the entropy expression are the probabilities that a given single character appear in the text. This will be higher than the true entropy. Shannon's estimate for H_1 was 4.03 bits/character. This assumes a 27 character alphabet. The entropy for ASCII-represented English will be higher because of punctuation and capitals. The true entropy of English is much lower than this, of course. But for an simple measure to automatically distinguish between plaintext and ciphertext, it should suffice. Re: uuencoding. In my analysis before I assume that the uuencoding would be of random data. If it is not from random data, then the entropy will be lower. Thanks for the clarification. Eric
participants (4)
-
Eric Hollander
-
Eric Hughes
-
omega@spica.bu.edu
-
tribble@xanadu.com