The numbers you give are a bit off. Assuming a 32-bit machine, n=64 implies a 2048-bit modulus, and a 4096-bit number to be reduced. Also, Karatsuba should use 1/3 (2*64^1.58 / 64^2) the multiplications rather than 1/5.
The n=64 implies two 2048-bit numbers are being multiplied. The 2048-bit number comes from the fact that in a typical crypto app, modexp will be reducing numbers as large as the modulus squared which runs 2048-bits for a 1024-bit modulus. The reciprocal is 1 block bigger than the number to be reduced. Hence, you are dealing with multiplying about two 2048-bit numbers. But since we only care about the "fractional" part of the result, we can safely throw away half the computation and only compute half the Karatsuba recursion tree. (the number before the decimal point is the quotient) Then, to determine the final remainder, we simply multiply by the modulus again, throwing away non-significant computation again. There is a normal n^2 method for reducing via reciprocal that only uses 1/4 the number of ops as the obvious technique. Your right about the 1/3 vs 1/5, I dunno where the 5 came from, must have been a typo in my calcs. The problem with Karatsuba is that it's hard to implement efficiently. Temporary ints should be kept to a minimum and be preallocated. The combine step requires 1 store, and 5 additions, of multiprecision integers. The split step requires no copying if you use pointer manipulation, and instead of shifting, don't add in place, but add "with shift" to the destination. Most of the implementations I've seen do too much copying and shifting. Given that some modern processors have efficient hardware multiply, it might not be worth all the trouble to trade mults for adds. If a processor has an efficient hardware FFT, it might even be worthwhile to use the FFT multiply method. Do you have a ref for the Montgomery method? I'm unfamilar with the name, I wonder if it's something I've seen before under a different label. Check out Schonhage's book "Fast Algorithms" They've implemented all the asymtotic algorithms efficiently and gathered performance data. I corresponded with Schonhage's grad student and he told me that Karatsuba wins for n>=8, which I find difficult to see, when it takes about n=32 for my own implementation (not optimized) to break even. -Ray