Re: Random Data Compressed 100:1 (Guffaw)
Ken Brown <k.brown@ccs.bbk.ac.uk> wrote :
Michael Motyka wrote:
Here we go :
"As Eric correctly points out, true random input simply cannot be compressed, it doesn't matter how clever you are or how much time and computing power you have access to."
This is a statement of belief isn't it? Odd.
No, it's a logical consequence of his definition of "random" & no more a statement of belief than "1+1=2" is. If you use the word to mean something else then it might or might not be true in your universe.
Ken
Perhaps my universe is off in all six degrees but answer this - which is it : random data cannot be compressed or random data may include recognizable and even repeating patterns A string of random data may include anything. Compressibility will vary with the particular data and the algorithm used. Lucky with one input set, unlucky with another. I wouldn't try to base a business on getting lucky with the customer's input data but the absolute statement of incompressibility is wrong unless you preselect your data sets. Then they're no longer random and the definition of random becomes circular. Mike
On 9 Jan 2002, at 10:24, Michael Motyka wrote:
Perhaps my universe is off in all six degrees but answer this - which is it :
random data cannot be compressed
or
random data may include recognizable and even repeating patterns
Both. I'll try this one last time and then give up on you. Any algorithm which maps files onto files will be AT LEAST as likely to increase the size of files as to decrease their size if fed random data. Further, the chance of being able to significantly decrease the size of a file derived from a random source with any given algorithm is vanishingly small (since there are 256 tomes as many N byte files as there are N - 1 byte files, the odds of a random file being compressible by a single byte by a given algorithm can't be any better than 1 in 256. The odds of being able to compress it by two bytes ae even worse).
A string of random data may include anything. Compressibility will vary with the particular data and the algorithm used. Lucky with one input set, unlucky with another.
No. Unlucky with virtually all of them. There aren't words to describe how unlikely it is that a 1 megabyte file of random data could be compressed by a factor of 100 by any given algorithm. It's not like winning the lottery. It's not even like winning the lottery 100 times in a row. We're talking vastly improbable.
I wouldn't try to base a business on getting lucky with the customer's input data but the absolute statement of incompressibility is wrong unless you preselect your data sets. Then they're no longer random and the definition of random becomes circular.
Mike
George
On Wednesday, January 9, 2002, at 09:24 AM, Michael Motyka wrote:
Ken Brown <k.brown@ccs.bbk.ac.uk> wrote :
Michael Motyka wrote:
Here we go :
"As Eric correctly points out, true random input simply cannot be compressed, it doesn't matter how clever you are or how much time and computing power you have access to."
This is a statement of belief isn't it? Odd.
No, it's a logical consequence of his definition of "random" & no more a statement of belief than "1+1=2" is. If you use the word to mean something else then it might or might not be true in your universe.
Ken
Perhaps my universe is off in all six degrees but answer this - which is it :
random data cannot be compressed
First, there is no such thing as "a set of random data" (which is what I take you to mean by "random data"). There are operational definitions of "a set of data generated by a process which appears to be random." Clicks of a Geiger counter, thermal noise from a reverse-biased diode, etc. (If my words "appear to be generated" are weaselly, it's the nature of the beast. It is generally not possible to either prove that a source _is_ random or that a set of data are random.) Of strings which we would all call random, of length N, the majority of them have no description shorter than themselves.
or
random data may include recognizable and even repeating patterns
Of course any particular string will have clumps and small patterns, etc. This doesn't make them compressible, however. Run-length encoding on a "random" data set or string doesn't yield any compression, on average. (For a different slant on why this is so, intuitively, it's related to why strategies for "doubling up on winning bets" fail. Theory of martingales.)
A string of random data may include anything. Compressibility will vary with the particular data and the algorithm used. Lucky with one input set, unlucky with another. I wouldn't try to base a business on getting lucky with the customer's input data but the absolute statement of incompressibility is wrong unless you preselect your data sets. Then they're no longer random and the definition of random becomes circular.
You are missing the big picture. About 1 / 1^(N-1) bit strings of length N will have descriptions shorter than the bit string. The rest will actually require longer descriptions, counting the necessary dictionary. As N approaches infinity, essentially a vanishingly small fraction percentage of the bit strings have short descriptions. This makes sense for a number of reasons, once you start thinking about it. Definition: Tim's Magic Number: 1 0 1 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 0 1 01 1 0 0 So I present you with the number : 1 0 1 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 0 1 01 1 0 0 and ask you to compress it. You look at it, recognize it as Tim's Magic Number, and proudly announce that it has a very concise (short) description: TMN. For why this will in general not work, and how this relates to intelligence tests (multiple good answers to "What is this pattern?" questions), and Berry's Paradox, these are all related to issues of randomness. I'm sorry, but I just can't write a chapter of a book on randomness, laying out all the basics. Even a tutorial on randomness has no shorter description than itself. There are a bunch of Web pages. Keywords being: randomness, bit string complexity, algorithmic information theory, Kolmogorov, Solomonoff, Chaitin, etc. There's a recent book devoted entirely to an introduction to randomness. Called "Randomness." John Casti, Rudy Rucker, and John Barrow have all written good popular accounts. I can't remember if John Pierce's seminal book "Symbols, Signals and Noise" goes into this, but I expect it does. --Tim May "If I'm going to reach out to the the Democrats then I need a third hand.There's no way I'm letting go of my wallet or my gun while they're around." --attribution uncertain, possibly Gunner, on Usenet
On Wednesday, January 9, 2002, at 02:39 PM, Tim May wrote:
You are missing the big picture. About 1 / 1^(N-1) bit strings of length N will have descriptions shorter than the bit string.
A typo. "1 / 2^(N-1)" is what I meant to type. If N = 100, about 1/2^99 of the bit strings have "short" (catchy, memorable, loosely speaking) descriptions. Most of the rest actually take more bits to describe if described in terms of "words" (small patterns, runs). This is what we mean when we say they are their own shortest descriptions. This is similar, by the way, to poker or bridge hands. In a fair deal, all hands are equally likely. However, because of our naming conventions and interest in "all hearts" or "a sequential run of spades from 10 to Ace," we are fooled into thinking some of these hands are "rarer" than "random hands." Again, this is just one facet of a very interesting topic. --Tim May
From: "Michael Motyka" <mmotyka@lsil.com>
random data cannot be compressed
or
random data may include recognizable and even repeating patterns
Both. As someone already explained, you can write a specific algorithm that will compress a string of 100 zeros to a single 0. Now, if given a random string of 100 bits, there is a probability of 1 in 2^100 that this string is all zeros - in which case you'll manage to compress it. But in all cases, your algorithm won't accomplish anything. And in the one case where you're lucky, the string is NOT random - it's specified. Mark
participants (4)
-
georgemw@speakeasy.net
-
Marcel Popescu
-
Michael Motyka
-
Tim May