Random Data Compressed 100:1 (Guffaw)

Tue Jan 8 15:20:49 PST 2002

Michael Motyka wrote:

> What exactly is random data? Does it have to appear to be random? 

> Does it have to pass some set of statistical tests to be random? If a
> string or bits from a radiation source spells "Merry Christmas and a
> Happy New Year" in ASCII is it non-random - a message from above?

Randomness is generally defined for sequences, which are mappings from the
non-negative integers onto some finite set of symbols.  A sequence is
random if its values are independently selected and uniformly distributed.

A random sequence of ASCII characters will not only spell out "Merry
Christmas and a Happy New Year" numerous times, but will spell out every
other possible piece of text as well.  In a random sequence of bits, all
2^N N-bit strings are equally likely to occur as we step through the
sequence, for any chosen value of N.

The Theory of Compression works with sequences.  Finite sets of data, or
strings, are considered to be initial segments of sequences.

There is also the notion of the Kolmogorov complexity of a string, which
is the length of the shortest program which generates it.  This also leads
to the Kolmogorov definition of a random string, as one whose shortest
description is itself.

Most practical compression algorithms, however, do not work with the
complexity of data in the Kolmogorov sense, but merely compress a sequence
using a fixed amount of its previous history as a dictionary.

As a simple example of how a good compression algorithm might work,
consider a person sitting in front of a console that displays the first
N-1 characters of some text.  The person tries to guess character N and
makes successive guesses until the computer tells him he is successful.  
If we entropy code the number of guesses at each point, and we presume
different persons use the same deterministic algorithm for guessing based
on a knowlege of the English language, and on what has previously occurred
in the text, we can transform any English text of more than trivial length
into a lesser number of bits than are required to represent it in ASCII.

If we imagine such a scheme applied to the random binary sequence
described earlier, the person will on the average require 1 guess 50% of
the time, and 2 guesses the other 50% of the time to correctly identify
the next bit.  Entropy coding two symbols which occur with equal frequency
requires one bit for each occurrence.  In this case, attempting
"compression" requires one bit output for each bit input, demonstrating
that a random sequence cannot be compressed.

Talking about a finite string being "random" in the commonly accepted
non-Kolmogorov sense is a meaningless notion, apart from the string being
the initial segment of some random seqence.  Any random string is produced
by taking a finite number of instances of some random variable, and such a
random variable has, by definition, an infinite sequence of such instances
over which its mathematical properties are defined.

-- 
Eric Michael Cordian 0+
O:.T:.O:. Mathematical Munitions Division
"Do What Thou Wilt Shall Be The Whole Of The Law"