I mentioned a few days ago that one of the "rump session" papers at the crypto conference claimed that a machine could be built which would find MD5 collisions for $10M in about 20 days. I wanted to write a little more detail about how this attack could work. It is similar to a "meet in the middle" (MITM) attack which Norm Hardy suggested here in July when we were discussing double DES:
There may be more than one way that MITM (meet in the middle) may be used to attack Double block cyphers. I assume the following attack. You know some block of plain-text P and corresponding cypher text C. You believe that C = E(k, E(j, P)) where E(k, p) is the encypherment of p with key k. D(k, E(k, p)) = p. You need to find keys k and j. Classic MITM is to produce a file A with records: <k, E(k, P)> for each k, and file B with records <j, D(j, C)> for each j. Sort both A and B on the second field. Pass over the sorted files looking for a record from file A whose second field is the same as a record in file B.
To substantially shorten the ammount of tape used by a factor 2^n at the expense of evaluating C and D 2^n more often do the following:
For m from 0 to 2^n-1 Do Produce file A with records: <k, E(k, P)> for each k where (the right n bits of E(k, P)) = m. (discarding other records) Produce file B with records <j, D(j, C)> for each j where (the right n bits of D(j, C)) = m Sort files A and B on second field. Pass over files looking for records from A that match records from b in the second field. Enddo.
The idea of saving only outputs where certain bits are constant is the key to the "distinguished points" method which is used to save space with only a modest cost in time. The other key idea is that instead of evaluating MD5(n) where n iterates on its own, you look for cycles in the recurrence x = MD5(x). Any cycle which is found which does not include the x you start with will lead to a case where two values hash to the same MD5 value. For a trivial example, suppose the output of a formula like this consists of the values 1,4,5,2,7,8,5,2,7,8,5,2,7,8,.... Here we have a four element cycle which leads to two different predecessors for the value 5. The brute-force way to solve this would be to save all outputs from the formula, and with each new value to compare it with all earlier values. With MD5, which has a presumably random structure and 128 bits of output, the birthday paradox suggests that you would have to create and save about 2^64 output values before finding a match. Creating 2^64 values might be possible today for the time and dollar values we are talking about, but storing them appears to be out of the question, as our earlier discussion of double DES (and other discussions of MITM here) have made clear. The distinguished points method reduces the space requirements by only saving a fraction of the output values. For example, in the list above, we might only save multiples of 4. This would lead to 4,8,8... and it is easy to discover the match without nearly as much storage. Note, though, that 8 is not actually the value which has two predecessors, but that once this match is discovered, you can go back to the previous points (4 and 8 in this case) and run them forward more carefully, looking for a match. The other real advantage of the distinguished points method is that it parallelizes very nicely. Several machines can run x=MD5(x) with different starting values, saving all of the distinguished outputs, and we can look for matches between machines as well as in one machine. Again, a match implies two different predecessors for the same value, which is an MD5 collision. With the size of MD5, suppose we generate 2^64 outputs but only save those for which the low-order 32 bits are 0 as our distinguished points. Only 1/2^32 of values will match, so we will end up with about 2^32 outputs, probably a manageable amount. Chances are there will be a match among that set. We then go back to the previous distinguished points before the match and work forward carefully to look for the exact pair of values which lead to the same successor. Distinguished points will be about 2^32 apart so this step is easy and quick. If you want to speed it up still more you can do a recursive distinguished points pass for this step using maybe d.p.'s with the low-order 16-bits of 0 and do it in two steps that will both be very short. The net result is that we have taken virtually no more time (the 2^64 creations of MD5 will dominate) and virtually no space (compared to 2^64 stored values) and we get the effect of a birthday attack. This is another cautionary data point about the risks of relying on space costs for security rather than time costs. Hal