Anyway, I played around with the algorithm a little, and it's neat and easy to implement, but the speed increase is not worth the patent hassle (assuming there is a speed increase, I saw none)
The algorithm is still basically O(n^2) if used in a modexp routine. It requires n^2 multiplications and additions. Whereas, a typical Karatsuba multiplication using a high precision reciprocal will only use 2*n^1.5 multiplications and 5*n^1.5/8 additions. (for n=64 which is a 2048-bit number being reduced, it's about 1/5 the multiplications, but 5 times the additions)
I agree with you that the patent hassle is probably not worth the speed increase. If I came up with the algorithm by myself and on my own time, I certainly would not have filed a patent for it, but that wasn't the case. I also agree that the patent system should be abolished, but there is nothing I can do about that either. The speed increase does exist over Montgomery's modular reduction because it uses n*n multiplications and 1 division compared to n*(n+1) multiplications, and the pre- and post-calculations are much simpler. Division using Karatsuba multiplication does seem to have a better asymptote, but is probably slower for most practical lengths. Both Lenstra's LIP and Lacy's CryptLib use Montgomery for modular reduction. The numbers you give are a bit off. Assuming a 32-bit machine, n=64 implies a 2048-bit modulus, and a 4096-bit number to be reduced. Also, Karatsuba should use 1/3 (2*64^1.58 / 64^2) the multiplications rather than 1/5. Wei Dai