Summary: Steganography is essential for private communication since well-encrypted messages stand out too easily and no "solidarity" of sophisticated cryptography users is likely to make such messages less obvious any time soon. By "steganography" I mean inserting a hidden message into ordinary text in such a way that even if the algorithm for inserting the hidden message is public, only the intended receiver can read the hidden message or even show that a hidden message exists. I list several types of measures of "normal" English text that may be useful for steganalysis and then I present calculations suggesting that English has a steganographic capacity of about 10 percent. Note: This is my "newbie" post to cypherpunks. It asks many questions because there is a lot that I do not know, but I hope it also has several thought-provoking ideas. I am mostly trying to elicit feedback from those who are more knowledgeable about cryptology-related matters by providing them with some problems that are both useful and mentally stimulating. Failed PGP Social Program In his introduction to PGP, Phil Zimmerman compares plaintext messages to mail sent on postcards and encrypted messages to mail in sealed envelopes. Currently, using envelopes does not arouse suspicion because almost everyone uses envelopes, but using encryption does arouse suspicion because almost nobody uses encryption. Zimmerman's proposed solution is for almost everyone to use encryption routinely, so that encrypted messages will be the norm. I do not believe that this will succeed, at least not in the way Zimmerman hopes. Even though PGP is highly regarded, free, and fairly readily accessible, no "solidarity" of PGP users will arise unless email with PGP encryption becomes transparently convenient to use and also does not invite civil lawsuits or criminal charges. (An RSAREF version of PGP would help, though.) The kinds of encryption that *will* become readily available, easy-to-use, and legally hassle-free will be the crippled kinds of encryption. Encryption that is not crippled always will be suspect, perhaps illegal. By using sufficiently intelligent steganographic techniques, however, we will not need any "solidarity" from other people at all. If our "envelopes" look like "postcards," they will not arouse the stormtroopers. Steganography and Steganalysis A few people have experimented with inserting messages into image files. But most of our email traffic is text, so I am most interested in steganographic techniques for normal English prose. Furthermore, we need to have a reasonably high efficiency for inserting the hidden message while not contorting the text too far from normal. Peter Wayner's Mimic functions for producing a baseball game commentary are notable. (No, I still haven't done the C conversion of the Think Pascal version I received almost two years ago. But I haven't forgotten!) I am not certain how efficiently his program encodes the hidden message, but I do want the resulting text to be less conspicuous. Imagine thousands of messages per day consisting of similar sounding commentary on the Whappers and the Blogs! That's too obvious. Gus Simmons [CRYPTO83] has described subliminal messages, which certainly are suitably innocuous, but unfortunately far too low bandwidth. A good steganographic system should insert encrypted messages into English text so unobtrusively that nobody but the intended receiver can show that a hidden message exists, even if the algorithm for the steganographic system is made public. (Perhaps I should call this "stealthography"?) The examples of steganography described in [KAHN] all fail this test. Similarly, so do silly kinds of "steganography" such as the following "SECRET": So how have you been doing? Everything is fine here. Can we visit soon? Remember when we went white-water rafting? Everyone got soaked! That would be fun to do again! This is silly not only because the hidden message is not encrypted but also because anyone who knows the insertion algorithm can readily discover that a hidden message does indeed exist. To create a good cryptographic system, one must first do cryptanalysis. Similarly, I suggest that to create a good steganographic system, one should first do steganalysis. For that reason, the next section of this message focuses on potential tools for steganalysis. Perhaps people more knowledgeable about steganalysis will tell how best to make use of these, and other, tools for steganalysis. Disclaimer: I admit that my knowledge of steganalysis is limited. Perhaps at this point I should just ask what I should read to learn more about this, but I suspect that the public literature is sparse and scattered. For example, we have the words "encryption" and "decryption", but what do we call the corresponding words for steganography: steganization and desteganization? If we don't even have good terminology for the process, I suspect that we do not have much well-organized literature on it, either. What follows is my best guess concerning steganographic issues. The first goal of steganalysis is to determine that a hidden message is likely. The second goal is extracting that hidden message and the third goal is decrypting that hidden message. To be able to infer that a hidden message is likely, we need measures that distinguish normal from unusual English text. Measures of Normal English Text What is normal English text? In general, this is unsolvable, and not even well-defined. It depends on the context, author, subject, etc. Nevertheless, I can think of several kinds measures that are likely to be useful and I hope that other people can suggest more. (1) letter frequency Letter frequency is just the first order Markov model for English. Shannon showed how 2nd order, 3rd order, etc. Markov models enable increasingly English-like output from a memoryless source. How much deviation from these standard frequencies is normal? What other kinds of letter frequency-related statistics might be useful? For example, if you measure the number of characters between each occurrence of a particular character, what type of distribution of intervals should you get? (An exponential distribution? A Poisson distribution? An Erlang distribution?) (2) word frequency Shannon also constructed 1st order, 2nd order, etc. Markov approximations to English using words rather than characters as the elements. How much variation should we expect from these approximations in ordinary English? Zipf's Law [WELSH, p. 97] states that the word frequency for a language obeys the formula: p(n) = A / n where A is a constant chosen so that: SUM p(n) = 1 n For example, in English, the most frequently used words are, in order, "the", "of", "and", and "to". According to Zipf's Law, the word "the" should be used about twice about as often as the word "of" and about four times as often as the word "to". Mandelbrot suggested a more complex formula: p(n) = A / (n + V)^(1/D) where V and D are independent parameters. I suppose that the intelligence agencies have even more sophisticated models. (3) compressibility According to [WELSH, p. 96], Shannon's experiments measured the entropy of English (over a 26 letter alphabet plus a space) as only 0.6 to 1.3 bits per character. Since normal English text has both upper and lower case, digits, and other characters, perhaps a better value for normal English is about 2.5 bits per character. (If so, then shouldn't compression programs be able to achieve about a factor of 8 / 2.5 > 3 compression?) Is "dense" writing less compressible than "fluff"? Apparently so, since measurements of the redundancy of various English texts [WELSH, p. 100] show significant differences. Since well-encrypted messages are incompressible, will a message that hides an encrypted message be less compressible than normal English text? (4) grammar, style, and readability Grammar checkers can distinguish normal sentences from text such as: "Distinguish normal can grammar checkers text sentences from." that may satisfy other statistics for normal English text. But what is an ordinary distribution of legal grammars of English sentences? Also, how does one allow for the different conventions in formal, written English vs. conversational English vs. slang vs. email/USENET netspeak vs. special sublanguages such as computer languages or mathematics? Bear in mind that netspeak has several distinguishing features. For example, email addresses of the form xxx@xxx.xxx.xxx, quoted text with a ">" in column 1, and smilies are typical net conventions. Mail headers and signatures (especially PGP signatures) have a special structure, too. Can a grammar checker help to distinguish normal text from text that may have a hidden message? What useful clues may style and (Kincaid, Coleman-Liau, Flesch, etc.) readability scores give? An interesting experiment would be to compare automated readability scores with the compressibility of the text. (5) semantic continuity and logic Do the sentences in a paragraph relate somehow to each other, or are they separate, independent constructions? How can that be measured automatically? (6) message context Does the content of the message look normal in its context? (For example, a baseball play-by-play would look out-of-place in sci.med.) How can that be measured automatically? (7) obvious Some people are known suspects, no matter how innocuous-looking their messages are. All their messages are suspect. (8) other measures What other measures might be useful for detecting the likely presence of a hidden message? The distribution of number of words in a sentence? The distribution of number of sentences (or words) in a paragraph? What programs and/or databases are readily available for making these measures? Steganographic Capacity of English Text If the public English text is N characters long, how long can a perfectly hidden message within that public text be? I think that it can be about N/10 characters long, for a steganographic capacity of 10%. I will show two ways to hide information in the public text: (1) the grammatical structure of the sentence and (2) the word choice in the sentence. (These are not the only methods, but they may be the two best methods.) Do you recall back in school when you "diagrammed" sentences in your English class? That was actually imposing a parenthesization on the sentence. For example, the sentence: The tall boy ate the big pie. becomes: (The (tall boy)) (ate (the (big pie))) The number of possible parenthesizations of a sentence of N words is related to the number of ways to match N pairs of parentheses. The number of matchings is the Nth Catalan number: C(2N, N) N-2 X(N) = -------- >= 2 [AHU, p. 73] N + 1 where C(2N, N) is the number of combinations of 2N objects, taken N at a time, which is (2N)!/(N!^2). The number of parenthesizations is the N-1st Catalan number. If all parenthesizations were equally likely, then the parenthesization of a sentence of N words would give greater than (N-1)-2 = N-3 bits of information for 1 - 3/N bits per word. (Of course, not all parenthesizations are equally likely. But X(N) is also much larger than 2^(N-2), so for now I'll assume that those two roughly cancel out.) Since the average word length in English is about 4 characters [WELSH, p. 101], or 5 characters counting a separating space, and each ASCII character has 8 bits, we get a steganographic efficiency of (1 - 3/N) / 40. (Notice that I am ignoring punctuation in my count of characters in English text. Since this count is just a rough approximation anyway, the effect of punctuation should get lost in the noise.) Another way to hide information in the public text is with the choice of words. Since English has a large vocabulary, I think that almost always we can get one bit of information per word, just from the word choice alone. (Unusual words should not be used often, though, since normal English text does not use them often.) For example, we might XOR all the bits of all the characters of the word and use its parity. Can we get two bits per word? Probably most of the time. Suppose that we try to get two bits per word from our word choice but succeed only with probability p. The channel capacity of a BSC is: 1 + p log p + (1-p) log (1-p) which is: 1 - H(p) By Shannon's noiseless coding theorem, we should be able to achieve an error correcting coding that approaches this capacity. (Use of that encoding unfortunately may alter the statistics of the hidden message sufficiently to expose the use of steganography, however.) For what values of p will it be worthwhile to insert an uncertain two bits per word rather than a (nearly) certain one bit? Since H(0.11) = 0.5 (approximately), p had better be .89 or higher. If p is .95, then H(p) = .29 (approximately), giving 1.4 bits / word rather than just 1 bit / word. I doubt that we can get better than 1.4 bits per word with this method and still have normal looking English, though, because of Zipf's Law. The normal frequencies of the four words "the", "of", "and", and "to" are high, totalling at least 10%, so the public text has to include many of them, whether we want their particular parity bit patterns or not. We can improve the efficiency by attempting two bits of information only for the long words and attempting only one bit for the short words. Maybe we should attempt to achieve |K/5| bits for words of K characters, where "|x|" means "x rounded down to the next integer". Or maybe we should not try to hide any bits at all in the extremely short words. I don't have enough information about typical English to analyze that. What is the total steganographic efficiency we achieve by exploiting both the grammatical structure and the word choice? My estimates total: ( (1 - 3/N) + 1.4 ) / 40 = 0.06 - 3/(40N) Just to get a number, let's assume that N = 10 words per sentence. That gives us 0.0525, which I'll round down to 0.05. That actually gives us much better than 5%, though, because the hidden message is first compressed and then encrypted. If compression halves the length of the hidden message, we get effectively a 10% efficiency for the Steganographic capacity of English. This estimate will decrease by whatever amount typical English parenthesization departs from uniform over all possibilities but it will increase by improved exploitation of word choice and, especially, by improved compression. Of course, the effectiveness of this camouflage depends on the sophistication of one's model of English text. Perhaps normal English has enough variation that a good, but not perfect, model of English will yield public text that is indistinguishable from normal text, even to the more resourceful eavesdroppers. Kevin Q. Brown INTERNET kqb@whscad1.att.com or kevin_q_brown@att.com AHU - The Design and Analysis of Computer Algorithms, Aho, Hopcroft, and Ullman, Addison-Wesley, 1974. KAHN - The Codebreakers, David Kahn, Macmillan, 1967. WELSH - Codes and Cryptography, Dominic Welsh, Claredon Press, 1988.
What about reverse 'stealthography' where instead of first creating your message, then attempting to create some larger ody of text in which to hide the message, one would first generate the message to be hidden, then take an existing body of text (something large enough, like Shakespear's MacBeth) and then attempt to find some concise algorithm by which the recieving end would extract the message? -Devin ---- ghoast@gnu.ai.mit.edu ----
Someone is probably doing steganography in netnews and/or mailing lists right now! (Besides cypherpunks, I mean.) How would we find them? Someone with a news feed and some CPU time and hacking time on their hands could come up with some analysis tools that scan news or email articles, looking for unusual patterns. You can debug them on something with a small flow, then gradually speed and smarten them up to be able to run across the whole netnews flow (at multiple sites). If nothing else, such a package would provide a way to winnow signal from noise on Usenet, by tweaking the parameters until they kicked out a reasonable number of messages per day. E.g. "give me the ten messages from rec.books that use the most varied vocabulary", or "locate C source code with lots of comments for my friend who's learning C". And, if some of us work on ways to hide information in the flow, and others work on ways to locate and extract it, the two efforts will complement each other. Think of it as "quality assurance" or "testing" for the information-hiding effort. We certainly won't be the only people looking! So let's see what NSA, KGB, etc are finding... Bill Tuthill's "hum" (humanities department support) package from comp.sources may give you some ideas. It's not 100% useful for this, but it's there: A new package of programs for literary and linguistic computing is available, emphasizing the preparation of concordances and supporting documents. Both keyword in context and keyword and line generators are provided, as well as exclusion routines, a reverse concordance module, formatting programs, a dictionary maker, and lemmatization facilities. There are also word, character, and digraph frequency counting programs, word length tabulation routines, a cross reference generator, and other related utilities. The programs are written in the C programming language, and implemented on several Version 7 Unix systems at Berkeley. hum/Part01: v10i27: Bull Tuthill's "hum" text concordance package, Part01/03 hum/Part02: v10i28: Bull Tuthill's "hum" text concordance package, Part02/03 hum/Part03: v10i29: Bull Tuthill's "hum" text concordance package, Part03/03 hum.pch: v11i065: Hum concordance package update kit in ftp.uu.net:/usenet/comp.sources.unix/volume10 and volume11. John Gilmore gnu@toad.com -- gnu@cygnus.com -- gnu@eff.org Creating freedom, rather than longer chains, bigger cages, better meals, . . .
Someone is probably doing steganography in netnews and/or mailing lists right now! (Besides cypherpunks, I mean.) How would we find them? Food for thought: that, at least as of recently, the NSA bought weekly dumps of all usenet articles on tape. I highly doubt they were for their reading pleasure... andrew
andrew m. boardman says:
Someone is probably doing steganography in netnews and/or mailing lists right now! (Besides cypherpunks, I mean.) How would we find them?
Food for thought: that, at least as of recently, the NSA bought weekly dumps of all usenet articles on tape. I highly doubt they were for their reading pleasure...
Many organizations buy complete dumps of usenet -- its a way of getting a newsfeed if your organization is too paranoid to let you get a network connection. I don't know if the NSA was such an organization, but it would not suprise me. Perry
participants (5)
-
andrew m. boardman
-
ghoast@gnu.ai.mit.edu
-
gnu
-
kqb@whscad1.att.com
-
Perry E. Metzger