fast modular reduction
During the Crypto' 95 Rump Session, Josh Benaloh of Microsoft Corp. presented a new modular reduction algorithm that he and I developed. It is faster than the Montgomery method by about 10 to 15%, and is more general and easier to understand. The central idea is that it is easy to reduce a number to an equivalent one that's just one "block" (machine word) longer than the modulus, by repeatedly subtracting off the highest block, and adding back something that's equivalent, but smaller. In the following pseudocode, B is the radix in which the numbers are represented (2^32 for a 32-bit machine), n is the length of modulus in blocks, U is B^(n+1) mod the modulus, X is the number to be reduced, k+1 is the length of X, and Y is the result. 1. Y = X 2. For i from k down to n+1, repeat steps 3 and 4 3. Y = Y - Y[i] * B^i + Y[i] * U * B^(i-n-1) 4. If Y >= B^i, then Y = Y - B^i + U * B^(i-n-1) Tricks can be used to eliminate step 4, and to reduce Y to n blocks using one single precision division, and n more single precision multiplications. The algorithm will hopefully be written up more completely soon. Wei Dai
In the following pseudocode, B is the radix in which the numbers are represented (2^32 for a 32-bit machine), n is the length of modulus in blocks, U is B^(n+1) mod the modulus, X is the number to be reduced, k+1 is the length of X, and Y is the result.
1. Y = X 2. For i from k down to n+1, repeat steps 3 and 4 3. Y = Y - Y[i] * B^i + Y[i] * U * B^(i-n-1) 4. If Y >= B^i, then Y = Y - B^i + U * B^(i-n-1)
Is there a proof of correctness available for this algorithm? It looks almost like a Radix-B peasant division algorithm with some modifications. Is there an algorithmic analysis available? I also I think there is a bug in your description. Let k+1 = n+1 (e.g. the dividend is 1 more "block" than the modulus). Then i=n starting out, and we have 3. Y=Y - Y[n] * B^n + Y[n] * U * B^(n-n-1) [we have B^-1] I'm assuming this was unintended. How does this algorithm compare to computing the reciprocal via Newton's Formula, and then multiplying by the reciprocal using Karatsuba multiplication? While I was at IBM Watson I invented a modular reduction algorithm that saves 1/4 the number of multiplications required on average once you have the reciprocal computed. -Ray
I wrote:
modifications. Is there an algorithmic analysis available? I also I think there is a bug in your description. Let k+1 = n+1 (e.g. the dividend is 1 more "block" than the modulus). Then i=n starting out, and we have
Upon a closer look, I see there's no mistake. The algorithm will never reach k=n because the loop stops at n+1. Anyway, I played around with the algorithm a little, and it's neat and easy to implement, but the speed increase is not worth the patent hassle (assuming there is a speed increase, I saw none) The algorithm is still basically O(n^2) if used in a modexp routine. It requires n^2 multiplications and additions. Whereas, a typical Karatsuba multiplication using a high precision reciprocal will only use 2*n^1.5 multiplications and 5*n^1.5/8 additions. (for n=64 which is a 2048-bit number being reduced, it's about 1/5 the multiplications, but 5 times the additions) Two other possible algorthms are: Let P(x) = sum(i=0 to n-1) a_i x^i be a multiprecision integer radix x. If m is a modulus, of length n/2, rewrite P(x) as sum(i=0 to n/2-1) a_i x^i + x^(n/2) (a_{n/2 + i} x^i) break the summation into two parts. Focus on the second term. (both terms are not equal, or one digit larger than the modulus) Perform modular reduction of the right hand polynomial using Horner's method x*(x*(x*...(x*a_i + a{i-i} mod m)mod m)mod m) Those internal mod m's can be done quickly with a 2-digit trial quotient estimation. It's still O(n^2), but might be quicker. Still another technique.. Rewrite P(x) (a_0 + a_2 x^2 + a_4 x^4 + ...) + x (a_1 + a_3 x^2 + a_5 x^4 + ...) [broken into two Polys with odd and even terms) Factor out x^2 out of each piece and write a_0 + ((a_2 + a_4 x^2 + a_6 x^4 + ...)*x^2) + x*(a_1 + x^2*(a_3 + a_5 x^2 + a_7 x^4 + ...) Now keep applying the recursive rule until the length of the poly pieces are the same or smaller than the modulus. Now, start evaluating from the inner layers. Multiply each piece by x^2 (two shifts), and take the mod. Sum the results, shifting one side by 1 (for the x factor). Shifts are free because an array representation yields a shift with a pointer movement. It looks kinda like the method for evaluating FFTs a little bit, but it's not. Just something off the top of my head just now. (I hereby place it in the public domain assuming it's worth anything, no patents please) I think with a clever implementation, you can trade some mults for more adds, but still use less additions than russian peasant. -Ray
Anyway, I played around with the algorithm a little, and it's neat and easy to implement, but the speed increase is not worth the patent hassle (assuming there is a speed increase, I saw none)
The algorithm is still basically O(n^2) if used in a modexp routine. It requires n^2 multiplications and additions. Whereas, a typical Karatsuba multiplication using a high precision reciprocal will only use 2*n^1.5 multiplications and 5*n^1.5/8 additions. (for n=64 which is a 2048-bit number being reduced, it's about 1/5 the multiplications, but 5 times the additions)
I agree with you that the patent hassle is probably not worth the speed increase. If I came up with the algorithm by myself and on my own time, I certainly would not have filed a patent for it, but that wasn't the case. I also agree that the patent system should be abolished, but there is nothing I can do about that either. The speed increase does exist over Montgomery's modular reduction because it uses n*n multiplications and 1 division compared to n*(n+1) multiplications, and the pre- and post-calculations are much simpler. Division using Karatsuba multiplication does seem to have a better asymptote, but is probably slower for most practical lengths. Both Lenstra's LIP and Lacy's CryptLib use Montgomery for modular reduction. The numbers you give are a bit off. Assuming a 32-bit machine, n=64 implies a 2048-bit modulus, and a 4096-bit number to be reduced. Also, Karatsuba should use 1/3 (2*64^1.58 / 64^2) the multiplications rather than 1/5. Wei Dai
The numbers you give are a bit off. Assuming a 32-bit machine, n=64 implies a 2048-bit modulus, and a 4096-bit number to be reduced. Also, Karatsuba should use 1/3 (2*64^1.58 / 64^2) the multiplications rather than 1/5.
The n=64 implies two 2048-bit numbers are being multiplied. The 2048-bit number comes from the fact that in a typical crypto app, modexp will be reducing numbers as large as the modulus squared which runs 2048-bits for a 1024-bit modulus. The reciprocal is 1 block bigger than the number to be reduced. Hence, you are dealing with multiplying about two 2048-bit numbers. But since we only care about the "fractional" part of the result, we can safely throw away half the computation and only compute half the Karatsuba recursion tree. (the number before the decimal point is the quotient) Then, to determine the final remainder, we simply multiply by the modulus again, throwing away non-significant computation again. There is a normal n^2 method for reducing via reciprocal that only uses 1/4 the number of ops as the obvious technique. Your right about the 1/3 vs 1/5, I dunno where the 5 came from, must have been a typo in my calcs. The problem with Karatsuba is that it's hard to implement efficiently. Temporary ints should be kept to a minimum and be preallocated. The combine step requires 1 store, and 5 additions, of multiprecision integers. The split step requires no copying if you use pointer manipulation, and instead of shifting, don't add in place, but add "with shift" to the destination. Most of the implementations I've seen do too much copying and shifting. Given that some modern processors have efficient hardware multiply, it might not be worth all the trouble to trade mults for adds. If a processor has an efficient hardware FFT, it might even be worthwhile to use the FFT multiply method. Do you have a ref for the Montgomery method? I'm unfamilar with the name, I wonder if it's something I've seen before under a different label. Check out Schonhage's book "Fast Algorithms" They've implemented all the asymtotic algorithms efficiently and gathered performance data. I corresponded with Schonhage's grad student and he told me that Karatsuba wins for n>=8, which I find difficult to see, when it takes about n=32 for my own implementation (not optimized) to break even. -Ray
participants (2)
-
Ray Cromwell -
Wei Dai