Randomness of a bit string
Here's a short article I wrote for sci.crypt aboout "randomness" of a bit string and the Kolmogorov-Chaitin definition that a string is random if and only if it has no shorter description than itself. This has some fascinating tie-ins to "cryptoregular" strings, which are strings which appear to be "regular" (a variant of randomness, meaning all digits are equally represented...high entropy) but which, with the right transformation, suddenly lose their regularity. (For you practical engineering folks, noise sources and other physical randomness sources will in most cases be enough, even if the randomness can never be "proved.") --Tim May Newsgroups: sci.crypt From: tcmay@netcom.com (Timothy C. May) Subject: Re: Randomness of a bit string Message-ID: <tcmayCK5CtF.23H@netcom.com> Date: Mon, 24 Jan 1994 18:32:03 GMT Bruce Grant (bgrant@umcc.umcc.umich.edu) wrote: : The usefulness of a one-time pad seems to hinge on whether the sequence : of key bits is really random. Could someone post a short, not too : technical definition of randomness of a bit string? In particular, is : this a mathematical property, or just a general measure of whether the : string is "predictable"? Does it depend on the nature of the cryptanalyst : or only on the string of bits? (In other words, if the key is based on : an Albanian translation of "Mary had a little lamb" is it random if you : don't know Albanian?) : Could a program test a key for randomness, or is this meaningless? A fascinating question! The answer lies at the heart of what we mean by randomness, complexity, predictability, regularity, and falls into the field of Kolmogorov-Chaitin complexity, or algorithmic information theory. Also called "descriptive complexity." Basic definition: A random string has no shorter description than itself. That is, it is incompressible. (Practically, we know "random strings" won't compress much...sometimes a compressor will shorten them, sometimes it will lengthen them. The notion above, that random strings will not compress, is very general and applies in the limit, not for some particular instance of a string--and some particular instance, e.g., "1 0 0 0 1 1 0" will of course have a good chance of having some particular compressions, some short description.) One consequence is "regularity": all digits of a base will be equally represented in the limit. Another consequence, as noted in one of the other followups to this question, is unpredictability of the next element or bit in a sequence. (Predictability of bits would imply a compression.) Cryptography is an interesting situtation. Charles Bennett talks about "cryptoregular" strings in a paper in the "Physics of Computation" Proceedings (1992, IEEE Press). A cryptoregular string _appears_ to have high entropy ("maximum randomness") and regularity (all symbols equally represented), and thus to be "random." But application of the _key_ will show the string is actually low entropy ("Mary had a little lamb, it's fleece was white as snow...") and is very compressible (the name of the song is the compressed version, for example). Good cryptography means cryptoregular strings. A fascinating discovery by Chaitin and others (Kolmogorov, Solomnoff, Martin-Lof, Levin all worked in this area) is that one can never prove a given sequence or string is "random." As in some diabolically clever IQ test, an apparently random sequence may have some shorter description, or compression, that means it does not fit this definition of randomness. Having said this, it is clear that for practical purposes, many sources used to generate "random numbers," e.g., noise diodes, alpha particles, tosses of a coin, etc., are "effectively random" (don't ask me to define this!) in that no compression/prediction will ever be done, though we can never be absolutely certain one does not exist! A nice book on this stuff just came out: "An Introduction to Kolmogorov Complexity and Its Applications," by Li and Vitanyi, 1993, Springer-Verlag. Cryptography per se is not mentioned (a disappointing lapse), but the ideas are widely applicable. --Tim May -- .......................................................................... Timothy C. May | Crypto Anarchy: encryption, digital money, tcmay@netcom.com | anonymous networks, digital pseudonyms, zero 408-688-5409 | knowledge, reputations, information markets, W.A.S.T.E.: Aptos, CA | black markets, collapse of governments. Higher Power:2**859433 | Public Key: PGP and MailSafe available.
There seems to be a misinterpretation of the point I was making about randomness and how no number (or sequence) can be _proved_ to be random.
This has some fascinating tie-ins to "cryptoregular" strings, which are strings which appear to be "regular" (a variant of randomness, meaning all digits are equally represented...high entropy) but which, with the right transformation, suddenly lose their regularity. ... Basic definition: A random string has no shorter description than itself. That is, it is incompressible. ...
A fascinating discovery by Chaitin and others (Kolmogorov, Solomnoff, Martin-Lof, Levin all worked in this area) is that one can never prove a given sequence or string is "random." As in some diabolically clever IQ test, an apparently random sequence may have some shorter description, or compression, that means it does not fit this definition of randomness.
The point here is for a number or sequence which is _given_, just presented, as in: "Is the sequence 100010001010110010101 random?" Or, "Is the number 9045886804 random?" Variants of this question come up all the time, as in predicting the next term of a sequence, trying to determine if a sequence of characters is likely to be just noise or is instead likely to be a message, and in issues of whether data is maximally compressed or can be compressed still further. These numbers are "found objects" in the sense that one generally has no idea what "model" or "theory" generated them. Someone looking at the first example, 100010001010110010101, might subject it to all kinds of tests: -visual inspection to see if it's some "obvious" number (such as "1010101010101010" would be, or "01011101110111" might be) -statistical tests, to see if it deviates "significantly" from the expected pattern of random numbers (regular distribution of digits, of pairs, triples, quadruples, etc.). The usual arsenal of entropy measurements, chi-square tests, null hypothesis testing, etc. -other tests to see if the number is related to other known numbers, which could be things like the day of the year, the digits of pi, the phone number of whoever generated the number, etc. -other tests and guesses that cryptanalysts and puzzle-solvers are familiar with A plausible result for someone to announce, after such a series of tests, is "I can't find any patterns, and the distribution of digits falls within expected ranges. We've compared the number against the suspect's various numbers and can find no linkages. It looks pretty random to us." (By "random" he essentially means "like the result of a sequence of coin tosses." Fair coin, of course.) But can he ever say "I can prove the number is random"? No. There's always some chance an even-cleverer puzzle solver will find the pattern, the key that unlocks the randomness. For example, most ciphertexts pass nearly all statistical tests for randomness, "look" random, and even _act_ like random numbers (recall the Blum-Blum-Shub pseudorandom number generator and how good it is). But simple application of the key turns the seemingly random "100010001010110010101" into "ATTACK." Let's look at the second example. Is the number "9045886804" random or not? And can we _prove_ it's random? (If you're worred that these numbers are somehow too small, don't worry. The same reasoning applies to any number or sequence one might encounter, including short numbers and multi-page numbers or sequences (such as PGP might generate)). The cryptanalyst or problem-solver looks for the patterns, the statistical distributions and entropies, and _any other_ links he can think of. That is, his "models" for the generator of this number are not known to him, but he may make some guesses based on the owner of the number, the score in the SuperBowl, the age of Bill Clinton, etc. That is, he'll look to see if the number is some sort of simple cipher or transpostion based on one of the "unrandom" numbers around him. To cut to the chase, can he ever "prove" the number is random? Can he even claim that the generator of the number "must have" used a process that is commonly used to generate numbers with a good approximation to a random process (flippin coins, alpha counts, etc.)? Suppose he declares to his boss, Admiral Inman, that he has "proved" the number is "random." Inman says to him: "This post was written by this trouble-maker Tim May, who even gives his phone number in every post he writes. What happens if we reverse the digits of his number? 408-688-5409 turns into 9045886804! Some "random" number! Clean out your desk tonight." Now is it kosher to take the "theory" of my phone number and allow it to be included in the analysis of wheter a number is random or not? Of course it is! In the real world, this is what we mean by randomness and predictabilty, whether we can find patterns and structure. And this is what cryptanalysts really do, and what good password-guessing programs do: they take account owner information such as name, spouse's name, pet's name, birthdate, and any other information they can scrounge about an account owner and then run permuations and hope for the best. Some percentage of the time, the passwords are "guessed," meaning that they were not very random at all. (This was the point I was making about famous numbers (like "729"), paradoxes (there are no "uninteresting" numbers, because the smallest "uninteresting" number is automatically interesting, and in fact is has a short description), and the number listed in Chaitin's book. I hope this explanation here makes it a bit clearer.) In this real world of trying to break cyphers, all is fair. All models may be considered, though not all models can be (e.g., one would not try applying the phone number of Chester Umbizi in Nairobi, Kenya at random!). No number can be proved to have no shorter description than itself. And as various shorter descriptions are found, with whatevr effort it takes, it cannot be proved that the description is the shortest that will ever be found. It may be strongly susepected that no shorter description exists. In fact, most numbers are incompressible, but a simple counting argument, in any theory. (For example, of the 100-binary-digits, not many of them have 50-digit compressions, and even fewer have 10-digit descriptions. Work out the numbers.) So, if someone tells you they've "proved" a particular number is random, just smile. --Tim May -- -- .......................................................................... Timothy C. May | Crypto Anarchy: encryption, digital money, tcmay@netcom.com | anonymous networks, digital pseudonyms, zero 408-688-5409 | knowledge, reputations, information markets, W.A.S.T.E.: Aptos, CA | black markets, collapse of governments. Higher Power:2**859433 | Public Key: PGP and MailSafe available.
Thanks to Tim May for his *excellent* tutorial on randomness, which can be compressed into a single sentence: "Randomness is in the eye of the beholder." :-) Phil
Tim May writes:
A fascinating discovery by Chaitin and others (Kolmogorov, Solomnoff, Martin-Lof, Levin all worked in this area) is that one can never prove a given sequence or string is "random."
I believe this is overstating the case. The only theorem along these lines that I saw in Li and Vitanyi's book was that, for any logical theory, there are at most a FINITE number of strings that can be proven random. The upper bound on the number of strings that can be proven random is quite large, by the way -- it's larger than 2^n, where n is the minimum number of bits needed to represent the logical theory. Thus, although no algorithm can tell you, for all strings x, whether or not x is random, it may be possible to prove a few particular strings random (with respect to a given encoding of algorithms). ----------------------------------------------------------------------------- Kevin S. Van Horn | It is the means that determine the ends. kevin@bert.cs.byu.edu |
Kevin Van Horn writes:
Tim May writes:
A fascinating discovery by Chaitin and others (Kolmogorov, Solomnoff, Martin-Lof, Levin all worked in this area) is that one can never prove a given sequence or string is "random."
I believe this is overstating the case. The only theorem along these lines that I saw in Li and Vitanyi's book was that, for any logical theory, there are at most a FINITE number of strings that can be proven random. The upper bound on the number of strings that can be proven random is quite large, by the way -- it's larger than 2^n, where n is the minimum number of bits needed to represent the logical theory. Thus, although no algorithm can tell you, for all strings x, whether or not x is random, it may be possible to prove a few particular strings random (with respect to a given encoding of algorithms).
I don't believe this is overstating the case at all. To quote Gregory Chaitin, from a context I cannot do justice here: "...leads to the demonstration that a specific number cannot be proved random." ("Information, Randomness, and Incompleteness: Papers on Algortithmic Information Theory," Second Edition, 1993) To see this another way, suppose an algorithm existed to always know if a given number is "random" or not. Then application of this algorithm to the natural numbers would presumably find the "smallest random number," such as "729." (An inside joke.) But this smallest random number would itself be intensely interesting and hardly random. And so on, a la the Berry Paradox and other well-know cousins of Godel's Theorem. If someone claims they can "prove" the sequence "0 1101100110111100010" is really random, ask them _how_. Ask them if the compression "Chaitin 27," meaning the example number given on page 27 of Chaitin's book is not that same number, making it hardly random. (Is it cheating to invoke other systems, books, etc. in the definition? Hardly. Cryptographers do it all the time. The mass of planet motion observation data certainly _looked_ random to ancient astronomers, until Kepler found his amazing compression of the data.) There is a mass of stuff here, and much room for us all getting tangled up in what randomness really means, what algorithms are, formal definitions (with reference to Turing machines and whether they halt or not, etc.), and so on. I urge interested readers to read Chaitin's papers, which are focused on issues of randomness, and also the Li and Vitanyi book. I stand by my point that no number or sequence can be proved to be random. --Tim May -- .......................................................................... Timothy C. May | Crypto Anarchy: encryption, digital money, tcmay@netcom.com | anonymous networks, digital pseudonyms, zero 408-688-5409 | knowledge, reputations, information markets, W.A.S.T.E.: Aptos, CA | black markets, collapse of governments. Higher Power:2**859433 | Public Key: PGP and MailSafe available.
Timothy C. May says:
If someone claims they can "prove" the sequence "0 1101100110111100010" is really random, ask them _how_. Ask them if the compression "Chaitin 27," meaning the example number given on page 27 of Chaitin's book is not that same number, making it hardly random.
(Is it cheating to invoke other systems, books, etc. in the definition? Hardly.
Wrong, Tim. An algorithm must be self contained. If you have to refer to Chaitin's book in the algorithm, you must include it in the algorithm. For a proof, consider the following notion: you have a large number that you THINK is incompressable. Write it down in the "little book o' random numbers", now refer to it as the third number in the book. Obviously, of course, this is bullshit -- if you transmitted it to someone that way you would have to send the book, too. This is unlike your earlier (correct) proof that you can't show a number is random because where there an algorithm you could order the random numbers and the first would no longer be random, because the algorithm *is* self contained in that case.
The mass of planet motion observation data certainly _looked_ random to ancient astronomers, until Kepler found his amazing compression of the data.)
Its correct that Kepler compressed the string, but incorrect to note that having written the numbers in a book had anything to do with it. Perry
From: tcmay@netcom.com (Timothy C. May) I don't believe this is overstating the case at all. To quote Gregory Chaitin, from a context I cannot do justice here: "...leads to the demonstration that a specific number cannot be proved random."
Perhaps the context is relevant. Chaitin's `omega', for example, is Kolmogorov random (too bad!). (Omega is the sum over all x of m(x), where m(x) is the Solomonoff-Levin distribution.)
To see this another way, suppose an algorithm existed to always know if a given number is "random" or not. Then application of this algorithm to the natural numbers would presumably find the "smallest random number," such as "729." (An inside joke.) But this smallest random number would itself be intensely interesting and hardly random.
This is an informal argument, using an informal definition of randomness. Presumably in this discussion we could standardize on Kolmogorov randomness, to which definition Berry's paradox does not apply.
--Tim May
Eli ebrandt@jarthur.claremont.edu
Continuing the discussion on whether there may exist a few random strings that can be proven random... Tim May writes:
To see this another way, suppose an algorithm existed to always know if a given number is "random" or not. [Paradoxes follow]
But that's not what I was talking about; I specifically acknowledged that there was no such algorithm that ALWAYS gives you the answer. But even in the absence of a general algorithm to decide a problem, it may be possible to decide some specific instances. For example, a basic result of computability theory is that there is no algorithm that will, for any program P and input x, tell you if P eventually halts on input x. Yet there are many SPECIFIC instances of programs P and inputs x for which it has been proven that P halts on input x; this is what the whole business of formal proofs of program correctness is about.
If someone claims they can "prove" the sequence "0 1101100110111100010" is really random, ask them _how_. Ask them if the compression "Chaitin 27," meaning the example number given on page 27 of Chaitin's book is not that same number, making it hardly random.
This argument is invalid. To see why, let's review the definition of a random string. Randomness is defined in terms of Kolmogorov complexity, which is defined relative to any universal function U. (A universal function U takes as input an encoding of a Turing machine T, together with its input z; its output is undefined if T does not halt on input z, otherwise its output is the value T outputs on input z. Each different effective encoding of program-input pairs defines a different universal function.) The Kolmogorov complexity C_U(x) of a string x (relative to U) is defined to be the length of the shortest string y such that U(y) is defined and U(y) = x. In a sense, it doesn't matter which universal function you use, since it turns out that for any two universal functions U and V there exist constants c1 and c2 such that C_U(x) <= C_V(x) + c1 for all x, and C_V(x) <= C_U(x) + c2 for all x. A string x is defined to be random (w.r.t. U) if C_U(x) >= x. Trivially then, the empty string is a random string. Also, Tim's example is meaningless, since it does not give an algorithm. (Caveat: you COULD construct a universal function U that has Chaitin's book built in to it, but it is certainly NOT the case that every universal function has this property.) To prove that a nonempty string x is nonempty, it suffices to prove that for all strings y shorter than x, either U(y) is undefined or U(y) != x. This amounts to proving the output (and halting behavior) of a finite number of program-input pairs. For some strings x and universal functions U this task may be absolutely trivial. Consider a Turing machine T that always halts and always outputs the empty string, regardless of its input. Let z_1,...,z_m be m arbitrary strings, where m exceeds the number of strings shorter than x. It is straightforward to construct an effective encoding of program-input pairs for which (T,z_i) is encoded as the i-th bit-string in lexicographic order. Suppose that U is the corresponding universal function, and let y_i be the encoding of the program-input pair (T,z_i). Then U(y_i) is the empty string, for all 1 <= i <= m. Since the set { y_i : 1 <= i <= m } includes every string shorter than x, and x is nonempty, we then see that x is random (relative to U.) ----------------------------------------------------------------------------- Kevin S. Van Horn | It is the means that determine the ends. kevin@bert.cs.byu.edu |
Timothy C. May says:
Here's a short article I wrote for sci.crypt aboout "randomness" of a bit string and the Kolmogorov-Chaitin definition that a string is random if and only if it has no shorter description than itself.
With respect, Tim, this definition is insufficient. For cryptographic purposes, a string must not merely be incompressible but also unknown. One can imagine things that are uncontrollable and incompressable but well known -- such as, say the least signifcant bits in the payoffs on winning horses at some race track. Perry
participants (5)
-
Eli Brandt -
kevin@axon.cs.byu.edu -
Perry E. Metzger -
Phil Karn -
tcmay@netcom.com