Re: Random Data Compressed 100:1 (Guffaw)
georgemw@speakeasy.net wrote :
On 8 Jan 2002, at 9:51, Michael Motyka wrote:
Eric Cordian <emc@artifact.psychedelic.net> wrote :
Someone else needs to read the comp.compression FAQ.
http://www.reuters.com/news_article.jhtml?type=technologynews&StoryID=498720
-----
NEW YORK (Reuters) - A Florida research start-up working with a team of renowned mathematicians said on Monday it had achieved a breakthrough that overcomes the previously known limits of compression used to store and transmit data. ... ZeoSync said its scientific team had succeeded on a small scale in compressing random information sequences in such a way as to allow the same data to be compressed more than 100 times over -- with no data loss. That would be at least an order of magnitude beyond current known algorithms for compacting data. ... -- Eric Michael Cordian 0+
There may be a way to derive & patch together pseudorandom sequence generators, who knows. If there is anything real about this I wonder how long it takes to compress large blocks of arbitrary input data? Geological time scales anyone?
Mike
A meaningless question. As Eric correctly points out, true random input simply cannot be compressed, it doesn't matter how clever you are or how much time and computing power you have access to. Converseley, if you're just intereted in "compressing" the results of your own pseudo-random number generator, it's trivial to do it instantly, just store the seed and the desired file size.
George
So what is compression/decompression? Isn't it just finding corresponding sets using defined procedures, one set being "smaller" than the other? Which types of procedures are allowed and which are not? What exactly is random data? Does it have to appear to be random? Does it have to pass some set of statistical tests to be random? If a string or bits from a radiation source spells "Merry Christmas and a Happy New Year" in ASCII is it non-random - a message from above? If a string of bits from a natural source ( decay etc ) matches a string of bits from some PRNG is it somehow disqualified as truly random data? I think the classification random does not rule out the presence of local patterns however obscure. If a substring of data happens to match what is generated by some PRNG then that substring can be "compressed" to {generator, seed, count}. The geological timescale might be involved in matching generator outputs to input data sections or deriving generators for subsections of data. So, bottom line, my point, while not necessarily practical, was entirely missed by you even as you summarized it :-) Here we go : "As Eric correctly points out, true random input simply cannot be compressed, it doesn't matter how clever you are or how much time and computing power you have access to." This is a statement of belief isn't it? Odd. Mike
On Tue, Jan 08, 2002 at 02:27:47PM -0700, Michael Motyka wrote:
So what is compression/decompression? Isn't it just finding corresponding sets using defined procedures, one set being "smaller" than the other? Which types of procedures are allowed and which are not?
What exactly is random data? Does it have to appear to be random? Does it have to pass some set of statistical tests to be random? If a string
I'm naturally skeptical of this claim (until I can verify it for myself), but I do not believe the claim is "we can encode random data at 100:1." They seem to be talking about 100:1 lossless compression of real-world data, which is generally not random. -DEclan
Declan opines:
I'm naturally skeptical of this claim (until I can verify it for myself), but I do not believe the claim is "we can encode random data at 100:1."
From the article:
"ZeoSync said its scientific team had succeeded on a small scale in compressing random information sequences in such a way as to allow the same data to be compressed more than 100 times over -- with no data loss." Now of course it's possible they were horribly misquoted. Still, it is worrisome that so many people quoted in the article think such algorithmic gymnastics are mathematically possible. -- Eric Michael Cordian 0+ O:.T:.O:. Mathematical Munitions Division "Do What Thou Wilt Shall Be The Whole Of The Law"
Eric Cordian wrote:
Declan opines:
I'm naturally skeptical of this claim (until I can verify it for myself), but I do not believe the claim is "we can encode random data at 100:1."
From the article:
"ZeoSync said its scientific team had succeeded on a small scale in compressing random information sequences in such a way as to allow the same data to be compressed more than 100 times over -- with no data loss."
Now of course it's possible they were horribly misquoted. Still, it is worrisome that so many people quoted in the article think such algorithmic gymnastics are mathematically possible.
The overpowering stench of snake oil pervades the ether I particularly liked: "The techniques described by ZeoSync would mark a break with the dozens of existing compression technologies, including MPEG for video and music and JPEG for pictures and graphics are able to compact data at compression rates up to 10 times the original. These algorithms typically work by eliminating long strings of identifiable bits of data such as blue sky, green grass or the white background of a snow-covered landscape." Which sounds like someone who doesn't know what they are talking about being misreported by someone who doesn't understand (*) "ZeoSync said its scientific team had succeeded on a small scale" "The company's claims, which are yet to be demonstrated in any public forum" "ZeoSync, whose Web site can be located at http://www.zeosync.com/" "Among the scientific team working with ZeoSync is Steve Smale, one of America's most renowned mathematicians. Smale is an emeritus professor at the University of California at Berkeley and the 1966 winner of the Fields Prize, the Nobel Prize for researchers in this field. He could not be reached for comment on his role in the project." I bet. Ken Brown (*) clue for those who have too much of a life to bother with things like file structures and encoding (though why would they be reading cypherpunks?) - JPEG is a lossy compression method that tries to recode "real world" pictures in a way that *looks* almost as good as the real thing but takes up less space, by smoothing out gradual changes of colour (and other stuff as well). It doesn't "typically work by eliminating long strings of identifiable bits of data". And it doesn't compress "up to 10 times", it compresses as much as you like, with a trade-off between file size and image quality. MPEG, AFAIK, is similar.
I think they may be referring to a random string of _ASCII characters_. That would be subject to compression because it is not random at the bit level. But 100:1? I have no idea how to achieve that. Marc de Piolenc Declan McCullagh wrote:
What exactly is random data? Does it have to appear to be random? Does it have to pass some set of statistical tests to be random? If a string
I'm naturally skeptical of this claim (until I can verify it for myself), but I do not believe the claim is "we can encode random data at 100:1." They seem to be talking about 100:1 lossless compression of real-world data, which is generally not random.
-DEclan
-- Remember September 11, 2001 but don't forget July 4, 1776 Rather than make war on the American people and their liberties, ...Congress should be looking for ways to empower them to protect themselves when warranted. They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety. - Benjamin Franklin
On Tuesday, January 8, 2002, at 07:12 PM, F. Marc de Piolenc wrote:
I think they may be referring to a random string of _ASCII characters_. That would be subject to compression because it is not random at the bit level.
But 100:1? I have no idea how to achieve that.
The entropy of ordinary written English is about 2.2 bits per character. (Meaning, of course, that each new character contributes not much "surprise," i.e., is somewhat predictable. Entropies of various sources can be experimentally measured.) Others, particularly Eric Cordian, have already written at length on this claim of "100:1 compression of any file." Steve Smale at Berkeley is a famous and respected mathematician. He may be unaware of these claims, he may be senile, he may have worked on some small (no pun intended) aspect as a paid consultant, or the actual compression may be lossy and oriented toward replacing JPEG with a better lossy algorith (as Iterated Systems, IIRC, has been doing with fractal compression). The pigeonhole principle is not just a "theory," it's the _law_. You can't stuff distinct 64 things into 32 boxes without a lot of them being put into the same box. Which means reversing (decompressing) to the original won't work. The Chaitin-Kolmogorov view of algorithmic information theory is the best model we have. By the way, for all of you who struggled to understand Godel's Theorem expressed in the "logic" formulation that Godel first used--and that was the basis of Newman's fairly popular exposition--there are very understandable formulations of Godel's Theorem in terms of information theory and compressibility. Chaitin has a very clean one (no Kleene pun intended). Rudy Rucker does a good job of explaining it in "Mind Tools." Chaitin also has some good books and articles, though some of them jump too quickly into stuff like computing Omega...his "Scientific American"-level articles on the foundations of mathematics are better places to start...included in some of his books. --Tim May "A democracy cannot exist as a permanent form of government. It can only exist until the voters discover that they can vote themselves money from the Public Treasury. From that moment on, the majority always votes for the candidate promising the most benefits from the Public Treasury with the result that a democracy always collapses over loose fiscal policy always followed by dictatorship." --Alexander Fraser Tyler
On 9 Jan 2002, at 10:40, Tim May wrote:
By the way, for all of you who struggled to understand Godel's Theorem expressed in the "logic" formulation that Godel first used--and that was the basis of Newman's fairly popular exposition--there are very understandable formulations of Godel's Theorem in terms of information theory and compressibility. Chaitin has a very clean one (no Kleene pun intended).
Maybe I'm oversimplifying, but it seems to me that Godel's theorem follows trivilaly once you've heard of Cantor's diagonal slash. As I understand it, Godel's theorem says in essence that in any system complex enough to include the irrational numbers there must be statements which are true in that system but which cannot be proven in that sytem. But since there are uncountably many irrational numbers, there are uncountably many statements of the form A=A which are true but which cannot be expressed with a finite number of symbols. George
--Tim May
George
From: <georgemw@speakeasy.net>
Maybe I'm oversimplifying, but it seems to me that Godel's theorem follows trivilaly once you've heard of Cantor's diagonal slash. As I understand it, Godel's theorem says in essence that in any system complex enough to include the irrational numbers there must be statements which are true in that system but which cannot be proven in that sytem. But since there are uncountably many irrational numbers, there are uncountably many statements of the form A=A which are true but which cannot be expressed with a finite number of symbols.
I think you're oversimplifying. The theorem doesn't say "there are statements which can't be written", but "there are statements (implicitly writeable) which can't be proven". Mark
On Wednesday, January 9, 2002, at 12:58 PM, Marcel Popescu wrote:
From: <georgemw@speakeasy.net>
Maybe I'm oversimplifying, but it seems to me that Godel's theorem follows trivilaly once you've heard of Cantor's diagonal slash. As I understand it, Godel's theorem says in essence that in any system complex enough to include the irrational numbers there must be statements which are true in that system but which cannot be proven in that sytem. But since there are uncountably many irrational numbers, there are uncountably many statements of the form A=A which are true but which cannot be expressed with a finite number of symbols.
I think you're oversimplifying. The theorem doesn't say "there are statements which can't be written", but "there are statements (implicitly writeable) which can't be proven".
What Godel spent most of his proof doing was formalizing "statements" and what it means to write them. The "hand-waving version" of Godel's proof is related to Cantor's diagonalization proof, but didn't fall out of it trivially. --Tim May "Dogs can't conceive of a group of cats without an alpha cat." --David Honig, on the Cypherpunks list, 2001-11
Michael Motyka wrote:
What exactly is random data? Does it have to appear to be random?
Does it have to pass some set of statistical tests to be random? If a string or bits from a radiation source spells "Merry Christmas and a Happy New Year" in ASCII is it non-random - a message from above?
Randomness is generally defined for sequences, which are mappings from the non-negative integers onto some finite set of symbols. A sequence is random if its values are independently selected and uniformly distributed. A random sequence of ASCII characters will not only spell out "Merry Christmas and a Happy New Year" numerous times, but will spell out every other possible piece of text as well. In a random sequence of bits, all 2^N N-bit strings are equally likely to occur as we step through the sequence, for any chosen value of N. The Theory of Compression works with sequences. Finite sets of data, or strings, are considered to be initial segments of sequences. There is also the notion of the Kolmogorov complexity of a string, which is the length of the shortest program which generates it. This also leads to the Kolmogorov definition of a random string, as one whose shortest description is itself. Most practical compression algorithms, however, do not work with the complexity of data in the Kolmogorov sense, but merely compress a sequence using a fixed amount of its previous history as a dictionary. As a simple example of how a good compression algorithm might work, consider a person sitting in front of a console that displays the first N-1 characters of some text. The person tries to guess character N and makes successive guesses until the computer tells him he is successful. If we entropy code the number of guesses at each point, and we presume different persons use the same deterministic algorithm for guessing based on a knowlege of the English language, and on what has previously occurred in the text, we can transform any English text of more than trivial length into a lesser number of bits than are required to represent it in ASCII. If we imagine such a scheme applied to the random binary sequence described earlier, the person will on the average require 1 guess 50% of the time, and 2 guesses the other 50% of the time to correctly identify the next bit. Entropy coding two symbols which occur with equal frequency requires one bit for each occurrence. In this case, attempting "compression" requires one bit output for each bit input, demonstrating that a random sequence cannot be compressed. Talking about a finite string being "random" in the commonly accepted non-Kolmogorov sense is a meaningless notion, apart from the string being the initial segment of some random seqence. Any random string is produced by taking a finite number of instances of some random variable, and such a random variable has, by definition, an infinite sequence of such instances over which its mathematical properties are defined. -- Eric Michael Cordian 0+ O:.T:.O:. Mathematical Munitions Division "Do What Thou Wilt Shall Be The Whole Of The Law"
At 02:27 PM 1/8/2002 -0700, Michael Motyka wrote:
georgemw@speakeasy.net wrote :
On 8 Jan 2002, at 9:51, Michael Motyka wrote:
Eric Cordian <emc@artifact.psychedelic.net> wrote :
Someone else needs to read the comp.compression FAQ.
What exactly is random data? Does it have to appear to be random? Does it have to pass some set of statistical tests to be random? If a string or bits from a radiation source spells "Merry Christmas and a Happy New Year" in ASCII is it non-random - a message from above? If a string of bits from a natural source ( decay etc ) matches a string of bits from some PRNG is it somehow disqualified as truly random data?
I think the classification random does not rule out the presence of local patterns however obscure. If a substring of data happens to match what is generated by some PRNG then that substring can be "compressed" to {generator, seed, count}. The geological timescale might be involved in matching generator outputs to input data sections or deriving generators for subsections of data.
I came up with a similar approach in the late '80s (I may have even discussed it on the list a year or so back). I was interested in compressing image data, which has quite a bit of inter-pixel correlation. My approach was to run a Hilbert space filling curve through the image (it looks like a tightly wound maze). This allows one to maintain most of the pixel correlation while transforming from an array to line. Then I analyzed the auto correlation of the runs of various lengths and attempted to create generators which could produce the required combinations/permutations and auto correlations to code for the runs. I say attempted, because I was never able to find acceptable algorithms to satisfy my requirement. I still believe these algorithms exist, it was just my limitations in identifying the underlying math needed. steve
On Tue, 8 Jan 2002, Steve Schear wrote:
combinations/permutations and auto correlations to code for the runs. I say attempted, because I was never able to find acceptable algorithms to satisfy my requirement. I still believe these algorithms exist, it was just my limitations in identifying the underlying math needed.
http://www.google.com/search?q=IFS+image+compression&sourceid=opera&num=100&ie=utf-8&oe=utf-8
On 8 Jan 2002, at 14:27, Michael Motyka wrote:
Here we go :
"As Eric correctly points out, true random input simply cannot be compressed, it doesn't matter how clever you are or how much time and computing power you have access to."
This is a statement of belief isn't it? Odd.
Mike
No, it's s statement of mathematical fact, easily understandable by anyone of average intelligence with a little thought. The piegonhole principle states that if there are N pigeonholes and M > N objects that have to be stuffed into the piegonholes, at least one of the pigeonholes will have to contain more than one object. Any compression algorithm is a mapping between files. Therefore, there CANNOT exist an algorithm which will compress all files. Because there are always twice as many bit patterns N + 1 bits long as there are bit patterns N bits long. Clear enough? If not, consider this (which Eric also mentioned in his original post): If you really had an algorithm that could compress ANY file to 1/100 its size, you could run the output data through the algorithm again and get something 1/10000 the size of the original file. Now, the philsophically inclined might point out that, although there cannot exist an algorithm which can compress all files, and that any given "compression" algorithm is highly unlikely to be able to compress a truly random file, nonetheless there are no specific files of which you can say, "this file is uncompressible", since you could always devise an algorithm specifically designed to compress that particular file (for example, by mapping it to a file of zero length and mapping all other files to themselves with an extra 0 on the ebd). To which the paractical man replies, "you're in a balloon about 20 feet off the ground". George
Michael Motyka wrote:
Here we go :
"As Eric correctly points out, true random input simply cannot be compressed, it doesn't matter how clever you are or how much time and computing power you have access to."
This is a statement of belief isn't it? Odd.
No, it's a logical consequence of his definition of "random" & no more a statement of belief than "1+1=2" is. If you use the word to mean something else then it might or might not be true in your universe. Ken
On Wed, 9 Jan 2002, Ken Brown wrote:
Michael Motyka wrote:
Here we go :
"As Eric correctly points out, true random input simply cannot be compressed, it doesn't matter how clever you are or how much time and computing power you have access to."
This is a statement of belief isn't it? Odd.
No, it's a logical consequence of his definition of "random" & no more a statement of belief than "1+1=2" is. If you use the word to mean something else then it might or might not be true in your universe.
So, it depends on what the definition of "is" is? ;-) -- Yours, J.A. Terranson sysadmin@mfn.org If Governments really want us to behave like civilized human beings, they should give serious consideration towards setting a better example: Ruling by force, rather than consensus; the unrestrained application of unjust laws (which the victim-populations were never allowed input on in the first place); the State policy of justice only for the rich and elected; the intentional abuse and occassionally destruction of entire populations merely to distract an already apathetic and numb electorate... This type of demogoguery must surely wipe out the fascist United States as surely as it wiped out the fascist Union of Soviet Socialist Republics. The views expressed here are mine, and NOT those of my employers, associates, or others. Besides, if it *were* the opinion of all of those people, I doubt there would be a problem to bitch about in the first place... --------------------------------------------------------------------
participants (11)
-
Declan McCullagh
-
Eric Cordian
-
Eugene Leitl
-
F. Marc de Piolenc
-
georgemw@speakeasy.net
-
Ken Brown
-
Marcel Popescu
-
measl@mfn.org
-
Michael Motyka
-
Steve Schear
-
Tim May