Summary: Steganography is essential for private communication since
well-encrypted messages stand out too easily and no "solidarity"
of sophisticated cryptography users is likely to make such
messages less obvious any time soon. By "steganography"
I mean inserting a hidden message into ordinary text in such
a way that even if the algorithm for inserting the hidden
message is public, only the intended receiver can read the
hidden message or even show that a hidden message exists.
I list several types of measures of "normal" English text
that may be useful for steganalysis and then I present
calculations suggesting that English has a steganographic
capacity of about 10 percent.
Note: This is my "newbie" post to cypherpunks. It asks many
questions because there is a lot that I do not know, but
I hope it also has several thought-provoking ideas.
I am mostly trying to elicit feedback from those who are
more knowledgeable about cryptology-related matters by
providing them with some problems that are both useful and
mentally stimulating.
Failed PGP Social Program
In his introduction to PGP, Phil Zimmerman compares plaintext messages
to mail sent on postcards and encrypted messages to mail in sealed
envelopes. Currently, using envelopes does not arouse suspicion
because almost everyone uses envelopes, but using encryption does
arouse suspicion because almost nobody uses encryption. Zimmerman's
proposed solution is for almost everyone to use encryption routinely,
so that encrypted messages will be the norm.
I do not believe that this will succeed, at least not in the way
Zimmerman hopes. Even though PGP is highly regarded, free, and
fairly readily accessible, no "solidarity" of PGP users will arise
unless email with PGP encryption becomes transparently convenient to
use and also does not invite civil lawsuits or criminal charges. (An
RSAREF version of PGP would help, though.) The kinds of encryption
that *will* become readily available, easy-to-use, and legally
hassle-free will be the crippled kinds of encryption. Encryption
that is not crippled always will be suspect, perhaps illegal.
By using sufficiently intelligent steganographic techniques, however,
we will not need any "solidarity" from other people at all. If our
"envelopes" look like "postcards," they will not arouse the
stormtroopers.
Steganography and Steganalysis
A few people have experimented with inserting messages into image
files. But most of our email traffic is text, so I am most interested
in steganographic techniques for normal English prose. Furthermore, we
need to have a reasonably high efficiency for inserting the hidden
message while not contorting the text too far from normal. Peter
Wayner's Mimic functions for producing a baseball game commentary are
notable. (No, I still haven't done the C conversion of the Think
Pascal version I received almost two years ago. But I haven't
forgotten!) I am not certain how efficiently his program encodes the
hidden message, but I do want the resulting text to be less
conspicuous. Imagine thousands of messages per day consisting of
similar sounding commentary on the Whappers and the Blogs! That's too
obvious. Gus Simmons [CRYPTO83] has described subliminal messages,
which certainly are suitably innocuous, but unfortunately far too low
bandwidth.
A good steganographic system should insert encrypted messages into
English text so unobtrusively that nobody but the intended receiver
can show that a hidden message exists, even if the algorithm for the
steganographic system is made public. (Perhaps I should call this
"stealthography"?) The examples of steganography described in [KAHN]
all fail this test. Similarly, so do silly kinds of "steganography"
such as the following "SECRET":
So how have you been doing?
Everything is fine here.
Can we visit soon?
Remember when we went white-water rafting?
Everyone got soaked!
That would be fun to do again!
This is silly not only because the hidden message is not encrypted but
also because anyone who knows the insertion algorithm can readily
discover that a hidden message does indeed exist.
To create a good cryptographic system, one must first do cryptanalysis.
Similarly, I suggest that to create a good steganographic system, one
should first do steganalysis. For that reason, the next section of
this message focuses on potential tools for steganalysis. Perhaps
people more knowledgeable about steganalysis will tell how best to make
use of these, and other, tools for steganalysis.
Disclaimer: I admit that my knowledge of steganalysis is limited.
Perhaps at this point I should just ask what I should read to learn
more about this, but I suspect that the public literature is
sparse and scattered. For example, we have the words "encryption" and
"decryption", but what do we call the corresponding words for
steganography: steganization and desteganization? If we don't even
have good terminology for the process, I suspect that we do not have
much well-organized literature on it, either. What follows is my
best guess concerning steganographic issues.
The first goal of steganalysis is to determine that a hidden message is
likely. The second goal is extracting that hidden message and the
third goal is decrypting that hidden message. To be able to infer that
a hidden message is likely, we need measures that distinguish normal
from unusual English text.
Measures of Normal English Text
What is normal English text? In general, this is unsolvable, and not
even well-defined. It depends on the context, author, subject, etc.
Nevertheless, I can think of several kinds measures that are likely
to be useful and I hope that other people can suggest more.
(1) letter frequency
Letter frequency is just the first order Markov model for English.
Shannon showed how 2nd order, 3rd order, etc. Markov models enable
increasingly English-like output from a memoryless source.
How much deviation from these standard frequencies is normal?
What other kinds of letter frequency-related statistics might
be useful? For example, if you measure the number of characters
between each occurrence of a particular character, what type of
distribution of intervals should you get? (An exponential
distribution? A Poisson distribution? An Erlang distribution?)
(2) word frequency
Shannon also constructed 1st order, 2nd order, etc. Markov
approximations to English using words rather than characters as
the elements. How much variation should we expect from these
approximations in ordinary English?
Zipf's Law [WELSH, p. 97] states that the word frequency for a
language obeys the formula:
p(n) = A / n
where A is a constant chosen so that:
SUM p(n) = 1
n
For example, in English, the most frequently used words are,
in order, "the", "of", "and", and "to". According to Zipf's Law,
the word "the" should be used about twice about as often as the
word "of" and about four times as often as the word "to".
Mandelbrot suggested a more complex formula:
p(n) = A / (n + V)^(1/D)
where V and D are independent parameters. I suppose that the
intelligence agencies have even more sophisticated models.
(3) compressibility
According to [WELSH, p. 96], Shannon's experiments measured the
entropy of English (over a 26 letter alphabet plus a space) as
only 0.6 to 1.3 bits per character. Since normal English text
has both upper and lower case, digits, and other characters,
perhaps a better value for normal English is about 2.5 bits per
character. (If so, then shouldn't compression programs be able
to achieve about a factor of 8 / 2.5 > 3 compression?)
Is "dense" writing less compressible than "fluff"? Apparently
so, since measurements of the redundancy of various English
texts [WELSH, p. 100] show significant differences.
Since well-encrypted messages are incompressible, will a message
that hides an encrypted message be less compressible than normal
English text?
(4) grammar, style, and readability
Grammar checkers can distinguish normal sentences from text such as:
"Distinguish normal can grammar checkers text sentences from."
that may satisfy other statistics for normal English text. But what
is an ordinary distribution of legal grammars of English sentences?
Also, how does one allow for the different conventions in formal,
written English vs. conversational English vs. slang vs. email/USENET
netspeak vs. special sublanguages such as computer languages or
mathematics? Bear in mind that netspeak has several distinguishing
features. For example, email addresses of the form xxx(a)xxx.xxx.xxx,
quoted text with a ">" in column 1, and smilies are typical net
conventions. Mail headers and signatures (especially PGP signatures)
have a special structure, too.
Can a grammar checker help to distinguish normal text from text
that may have a hidden message? What useful clues may style and
(Kincaid, Coleman-Liau, Flesch, etc.) readability scores give?
An interesting experiment would be to compare automated readability
scores with the compressibility of the text.
(5) semantic continuity and logic
Do the sentences in a paragraph relate somehow to each other,
or are they separate, independent constructions? How can that
be measured automatically?
(6) message context
Does the content of the message look normal in its context?
(For example, a baseball play-by-play would look out-of-place
in sci.med.) How can that be measured automatically?
(7) obvious
Some people are known suspects, no matter how innocuous-looking
their messages are. All their messages are suspect.
(8) other measures
What other measures might be useful for detecting the likely
presence of a hidden message? The distribution of number
of words in a sentence? The distribution of number of sentences
(or words) in a paragraph?
What programs and/or databases are readily available for making
these measures?
Steganographic Capacity of English Text
If the public English text is N characters long, how long can a
perfectly hidden message within that public text be? I think that it
can be about N/10 characters long, for a steganographic capacity of 10%.
I will show two ways to hide information in the public text: (1) the
grammatical structure of the sentence and (2) the word choice in the
sentence. (These are not the only methods, but they may be the two
best methods.)
Do you recall back in school when you "diagrammed" sentences in your
English class? That was actually imposing a parenthesization on
the sentence. For example, the sentence:
The tall boy ate the big pie.
becomes:
(The (tall boy)) (ate (the (big pie)))
The number of possible parenthesizations of a sentence of N words is
related to the number of ways to match N pairs of parentheses. The
number of matchings is the Nth Catalan number:
C(2N, N) N-2
X(N) = -------- >= 2 [AHU, p. 73]
N + 1
where C(2N, N) is the number of combinations of 2N objects, taken
N at a time, which is (2N)!/(N!^2). The number of parenthesizations
is the N-1st Catalan number. If all parenthesizations were
equally likely, then the parenthesization of a sentence of N words
would give greater than (N-1)-2 = N-3 bits of information for
1 - 3/N bits per word. (Of course, not all parenthesizations are
equally likely. But X(N) is also much larger than 2^(N-2), so
for now I'll assume that those two roughly cancel out.)
Since the average word length in English is about 4 characters
[WELSH, p. 101], or 5 characters counting a separating space,
and each ASCII character has 8 bits, we get a steganographic
efficiency of (1 - 3/N) / 40. (Notice that I am ignoring punctuation
in my count of characters in English text. Since this count is just
a rough approximation anyway, the effect of punctuation should get
lost in the noise.)
Another way to hide information in the public text is with the choice
of words. Since English has a large vocabulary, I think that almost
always we can get one bit of information per word, just from the word
choice alone. (Unusual words should not be used often, though, since
normal English text does not use them often.) For example, we might
XOR all the bits of all the characters of the word and use its parity.
Can we get two bits per word? Probably most of the time. Suppose
that we try to get two bits per word from our word choice but succeed
only with probability p. The channel capacity of a BSC is:
1 + p log p + (1-p) log (1-p)
which is:
1 - H(p)
By Shannon's noiseless coding theorem, we should be able to achieve
an error correcting coding that approaches this capacity. (Use of
that encoding unfortunately may alter the statistics of the hidden
message sufficiently to expose the use of steganography, however.)
For what values of p will it be worthwhile to insert an uncertain
two bits per word rather than a (nearly) certain one bit? Since
H(0.11) = 0.5 (approximately), p had better be .89 or higher. If p
is .95, then H(p) = .29 (approximately), giving 1.4 bits / word rather
than just 1 bit / word. I doubt that we can get better than 1.4 bits
per word with this method and still have normal looking English, though,
because of Zipf's Law. The normal frequencies of the four words
"the", "of", "and", and "to" are high, totalling at least 10%, so
the public text has to include many of them, whether we want their
particular parity bit patterns or not. We can improve the efficiency
by attempting two bits of information only for the long words and
attempting only one bit for the short words. Maybe we should attempt
to achieve |K/5| bits for words of K characters, where "|x|" means "x
rounded down to the next integer". Or maybe we should not try to hide
any bits at all in the extremely short words. I don't have enough
information about typical English to analyze that.
What is the total steganographic efficiency we achieve by exploiting
both the grammatical structure and the word choice? My estimates total:
( (1 - 3/N) + 1.4 ) / 40 = 0.06 - 3/(40N)
Just to get a number, let's assume that N = 10 words per sentence.
That gives us 0.0525, which I'll round down to 0.05. That actually
gives us much better than 5%, though, because the hidden message is
first compressed and then encrypted. If compression halves the length
of the hidden message, we get effectively a 10% efficiency for the
Steganographic capacity of English. This estimate will decrease by
whatever amount typical English parenthesization departs from uniform
over all possibilities but it will increase by improved exploitation
of word choice and, especially, by improved compression. Of course,
the effectiveness of this camouflage depends on the sophistication
of one's model of English text. Perhaps normal English has enough
variation that a good, but not perfect, model of English will yield
public text that is indistinguishable from normal text, even to the
more resourceful eavesdroppers.
Kevin Q. Brown
INTERNET kqb(a)whscad1.att.com
or kevin_q_brown(a)att.com
AHU - The Design and Analysis of Computer Algorithms,
Aho, Hopcroft, and Ullman, Addison-Wesley, 1974.
KAHN - The Codebreakers, David Kahn, Macmillan, 1967.
WELSH - Codes and Cryptography, Dominic Welsh, Claredon Press, 1988.