Re: Cracking MD5 for $10M

17 Dec 2003

      Jim Gillogly <jim@rand.org> writes:
...
Hal discusses using the Distinguished Points method to find hash
collisions presented by Michael Wiener with Paul van Oorschot at Rump
Crypto '94, and lists two benefits:
...
(1) saves space in searching for loops on a single processor;
(2) allows parallel searches for collisions over multiple processors.
...
I claim it's useful only for (2), because another algorithm dominates it
for single processor loop detection... at least in storage space.
["rho" method elided]
Yes, this is a good point, the main advantage of the DP algorithm is
that it parallelizes.  Rho does have the problem that you have to run
3 MD5's for each step, but OTOH it does not have the overhead of saving
and checking the distinguished points, so which one would be best on a 
single processor would depend on the relative costs.
...
Do you (Hal?) or anybody else know whether Wiener and van Oorschot were
taking into account the contraction of the range each time you iterate
MD5?  I think the size of the set of all numbers that are the result of
MD5ing a 128-bit number is considerably smaller than 2^128... is it 1/e of
that?  Anybody know about random mappings?
They didn't mention anything about this, and I would think they would have
if they had considered it.  My intuition was that x=MD5(x) would cover a
large fraction of the 128 bit output space, but on further thought Jim
appears to be right: with n input values into a random function (n would
be 2^128 in this case), the chance of a particular output being missed for 
any one input would be 1-1/n, and the chance of it being missed for all
n inputs would be (1-1/n)^n.  Taking the limit as n approaches infinity
gives 1/e as the fraction of values which would be missed.  This means
that the fraction of hits would be 1 - 1/e, much lower than I had
guessed.
...
Subsequent iterations reduce
it further, though of course not by 1/e each time, so that the set of
numbers that are the result of iteratively MD5ing a number N times should
be an appreciably smaller set to be groping around in.
The way I figure it, if the fraction of the original n is f (which would be
1 before the first iteration, and 1 - 1/e before the 2nd iteration based on
the above), the chance of a point being missed is (1-1/n)^(nf), which is
1/e^f.  So f would be found by f = 1 - 1/e^f, iterating once per MD5
iteration and starting f at 1.  I just did an experiment of iterating this.
After 100 times f was about .02; after 1000 times f was about .002,
suggesting f = 2/iterations.  If this is right, you might be able to get
a birthday match after only the cube root of n tries rather than the
square root of n, or about 2^44 iterations or so rather than 2^64, because
at that point you are only looking at 2^85 possible output values.

This result is only really valid for serial machines; parallel ones
search more per iteration so this would move you back towards the 2^64
number.  It does imply that you don't really get k-fold speedup with k
machines if you take this effect into consideration.
...
Jim Gillogly
  18 Halimath S.R. 1994, 16:12
Gee, my calendar must be off!

Hal