Re: why compression doesn't perfectly even out entropy

17 Dec 2003

      On Thu, 11 Apr 1996, Timothy C. May wrote:
...
That there can be no simple definition of entropy, or randomness, for an
arbitrary set of things, is essentially equivalent to Godel's Theorem.
I think what you mean is that there is no simple way to measure
randomness.  Simple, nice definitions of randomness do exist.  Actually
there are two closely related definitions of randomness.  The first is
entropy, which is a measure of the unpredictability of a random variable.
(A random variable is a set of values with a probability assigned to each
value.  In this case the values are bit strings.)  The second is
algorithmic complexity, which is a measure of the uncompressibility of a
bit string.  Notice that it doesn't make sense to talk about the entropy
of a string or the algorithmic complexity of a random variable.

Unfortunately both of these values are very difficult to measure.
Algorithmic complexity is provably uncomputable, because given a string
you can't tell when you have found the best compression for it.  Entropy
can in principle be determined, but in practice it's hard because you must
have a good probability model of the mechanism that generates the random
variable.

What we want to do is calculate the entropy of the output of a physical
random number generator.  Now if we have a probability model of the rng,
then we're home free.  For example, if the rng is tossing a fair coin 64
times, then it's easy to calculate that the entropy is 64 bits.  

But what if the rng is too complex to be easily modeled (for example if
it's a human being pounding on the keyboard)?  Algorithmic information
theory says the entropy of a random variable is equal to its expected
algorithmic complexity.  So if we could calculate algorithmic complexity,
then we can estimate the entropy by sampling the output of the rng many
times, calculate the algorithmic complexity of each sample, and take their
average as the estimated entropy.

Unfortunately, we already know that algorithmic complexity is NOT
computable.  The best we can do, and what is already apparently done in
practice, is to find an upper bound (call it x) on the algorithmic
complexity of a string by trying various compression schemes, divide that
number by a constant (say 10), and use x/10 as a conservative estimate of
the algorithmic complexity.

(Tim, I know you already understand all this, but your earlier
explanation wasn't very clear.  I hope this helps those who are still
confused.)
...
(To forestall charges that I am relying on an all-too-common form of
bullshitting, by referring to Godel, what I mean is that "randomness" is
best defined in terms of algorithmic information theory, a la Kolmogorov
and Chaitin, and explored in Li and Vitanyi's excellent textbook,
"Algorithmic Information Theory and its Applications.")
A year ago, you recommended me a book by the same authors titled _An
Introduction to Kolmogorov Complexity and Its Applications_.  Have the
authors written a new book, or are these the same?

Wei Dai