Jim Gillogly <jim@rand.org> writes:
Hal discusses using the Distinguished Points method to find hash collisions presented by Michael Wiener with Paul van Oorschot at Rump Crypto '94, and lists two benefits:
(1) saves space in searching for loops on a single processor; (2) allows parallel searches for collisions over multiple processors.
I claim it's useful only for (2), because another algorithm dominates it for single processor loop detection... at least in storage space. ["rho" method elided]
Yes, this is a good point, the main advantage of the DP algorithm is that it parallelizes. Rho does have the problem that you have to run 3 MD5's for each step, but OTOH it does not have the overhead of saving and checking the distinguished points, so which one would be best on a single processor would depend on the relative costs.
Do you (Hal?) or anybody else know whether Wiener and van Oorschot were taking into account the contraction of the range each time you iterate MD5? I think the size of the set of all numbers that are the result of MD5ing a 128-bit number is considerably smaller than 2^128... is it 1/e of that? Anybody know about random mappings?
They didn't mention anything about this, and I would think they would have if they had considered it. My intuition was that x=MD5(x) would cover a large fraction of the 128 bit output space, but on further thought Jim appears to be right: with n input values into a random function (n would be 2^128 in this case), the chance of a particular output being missed for any one input would be 1-1/n, and the chance of it being missed for all n inputs would be (1-1/n)^n. Taking the limit as n approaches infinity gives 1/e as the fraction of values which would be missed. This means that the fraction of hits would be 1 - 1/e, much lower than I had guessed.
Subsequent iterations reduce it further, though of course not by 1/e each time, so that the set of numbers that are the result of iteratively MD5ing a number N times should be an appreciably smaller set to be groping around in.
The way I figure it, if the fraction of the original n is f (which would be 1 before the first iteration, and 1 - 1/e before the 2nd iteration based on the above), the chance of a point being missed is (1-1/n)^(nf), which is 1/e^f. So f would be found by f = 1 - 1/e^f, iterating once per MD5 iteration and starting f at 1. I just did an experiment of iterating this. After 100 times f was about .02; after 1000 times f was about .002, suggesting f = 2/iterations. If this is right, you might be able to get a birthday match after only the cube root of n tries rather than the square root of n, or about 2^44 iterations or so rather than 2^64, because at that point you are only looking at 2^85 possible output values. This result is only really valid for serial machines; parallel ones search more per iteration so this would move you back towards the 2^64 number. It does imply that you don't really get k-fold speedup with k machines if you take this effect into consideration.
Jim Gillogly 18 Halimath S.R. 1994, 16:12
Gee, my calendar must be off! Hal