Re: HAVAL (was Re: crypto benchmarks)

Thanks. It looks like F4 and F5 are improved. Do you know how these optimizations can be done in general? I tried playing with F2 as a multivariate polynomial with coefficients in GF(2) in Mathematica. This seems to work and I found several equivalent expressions that take 13 operations (the original also takes 13 operations). Is there a tool that can do this automaticly?
I did the optimizations by hand. Simple rules of boolean arithmetic and logic (you know, things like Demorgan's Law applied to binary operations). Other processor-related optimizations can be done by hand, such as add x,x instead of shl x,1. I think I had the same proglems with F2 as well. Couldn't find a way to optimize it reasonably.
The biggest problem I have with HAVAL now is that with 4 or 5 passes the transform functions are larger than 10k even with compiler optimzation for size. Since the Pentium L1 instruction cache is only 8k, this makes HAVAL with 4 or 5 passes extremely slow. Do you have ideas how I can fit the transform functions into L1 cache?
You might do some creative optimization to use more registers than it does. I haven't looked at it in a while. The code was so huge and slow compared to optimized MD5 and SHS that I have up using it for an unfinished encrypted file system. Rob. --- "Mutant" Rob <wlkngowl@unix.asb.com> Send a blank message with the subject "send pgp-key" (not in quotes) for a copy of my PGP key.

Wei: I didn't see your original post, but did see Deranged's response. I would be interested to see whatever you come up with. On Sat, 20 Jan 1996 16:57:07 +0000, Deranged Mutant <WlkngOwl@UNiX.asb.com> wrote:
The biggest problem I have with HAVAL now is that with 4 or 5 passes the transform functions are larger than 10k even with compiler optimzation for size. Since the Pentium L1 instruction cache is only 8k, this makes HAVAL with 4 or 5 passes extremely slow. Do you have ideas how I can fit the transform functions into L1 cache?
You might do some creative optimization to use more registers than it does. I haven't looked at it in a while. The code was so huge and slow compared to optimized MD5 and SHS that I have up using it for an unfinished encrypted file system.
The reference implementation is TERRIBLE for small caches. You can shrink it significantly, however, by simply looping 4x across code that does the basic round operation for each of the 8 rotations -- something like: for( i = 4; --i; ) { FF_1(t7, t6, ...); FF_1(t6, t5, ...); FF_1(t5, t4, ...); FF_1(t4, t3, ...); FF_1(t3, t2, ...); FF_1(t2, t1, ...); FF_1(t1, t0, ...); FF_1(t0, t7, ...); } The basic macro for this is almost unchanged from the reference implementation. You can shrink it even further by, instead of coding the basic macro 8 times for each round, writing a round step that works on an array of 9 words (out of an array of 16), using 8 words as input and producing the ninth as output. You then have a two-level loop that invokes this 4x8 times, walking your working set 1 element in the array each time, and every 8 passes moving the 8 current variables back where they belong. The first pass through the loop, you use elements 15..8 as input, and produce element 7. The second pass, you use elements 14..7 as input, and produce element 6, etc. After 8 passes, you move elements 7..0 back up to 15..8, and start the inner loop over. Alternatively, you can begin with an array of 40 words (only 8 of which contain data), use a single loop that invokes the basic processing 32 times, walk your working set 1 word each time, and only move the working set back where it belongs at the end of the full round.

On Sat, 20 Jan 1996, John Lull wrote:
I didn't see your original post, but did see Deranged's response. I would be interested to see whatever you come up with.
I ended up doing it like this: for (i=0; i<4; i++) { FF_42(t7, t6, t5, t4, t3, t2, t1, t0, w[wi2[8*i+0]], mc2[8*i+0]); FF_42(t6, t5, t4, t3, t2, t1, t0, t7, w[wi2[8*i+1]], mc2[8*i+1]); FF_42(t5, t4, t3, t2, t1, t0, t7, t6, w[wi2[8*i+2]], mc2[8*i+2]); FF_42(t4, t3, t2, t1, t0, t7, t6, t5, w[wi2[8*i+3]], mc2[8*i+3]); FF_42(t3, t2, t1, t0, t7, t6, t5, t4, w[wi2[8*i+4]], mc2[8*i+4]); FF_42(t2, t1, t0, t7, t6, t5, t4, t3, w[wi2[8*i+5]], mc2[8*i+5]); FF_42(t1, t0, t7, t6, t5, t4, t3, t2, w[wi2[8*i+6]], mc2[8*i+6]); FF_42(t0, t7, t6, t5, t4, t3, t2, t1, w[wi2[8*i+7]], mc2[8*i+7]); } This allows all the transform functions to fit into L1 cache, but at a cost. Besides the overhead of the for loop, each macro call now does two extra table lookups (in wi2 and mc2). The net result is a ~100% speedup over the reference implementation. Also, FYI, the boolean functions used in the reference implementation can be optimized. Thanks to Deranged Mutant for these: /* #define f_2(x6, x5, x4, x3, x2, x1, x0) \ ((x2) & ((x1) & ~(x3) ^ (x4) & (x5) ^ (x6) ^ (x0)) ^ \ (x4) & ((x1) ^ (x5)) ^ (x3) & (x5) ^ (x0)) */ #define f_2(x6, x5, x4, x3, x2, x1, x0) \ (((x4&x5)|x2) ^ (x0|x2) ^ x2&(x1&(~x3)^x6) ^ x3&x5 ^ x1&x4) /* #define f_4(x6, x5, x4, x3, x2, x1, x0) \ ((x4) & ((x5) & ~(x2) ^ (x3) & ~(x6) ^ (x1) ^ (x6) ^ (x0)) ^ \ (x3) & ((x1) & (x2) ^ (x5) ^ (x6)) ^ \ (x2) & (x6) ^ (x0)) */ #define f_4(x6, x5, x4, x3, x2, x1, x0) \ ((((~x2&x5)^(x3|x6)^x1^x0)&x4) ^ ((x1&x2^x5^x6)&x3) ^ (x2&x6) ^ x0) /* #define f_5(x6, x5, x4, x3, x2, x1, x0) \ ((x0) & ((x1) & (x2) & (x3) ^ ~(x5)) ^ \ (x1) & (x4) ^ (x2) & (x5) ^ (x3) & (x6)) */ #define f_5(x6, x5, x4, x3, x2, x1, x0) \ ((((x0&x2&x3)^x4)&x1) ^ ((x0^x2)&x5) ^ (x3&x6) ^ x0) Wei Dai
participants (3)
-
Deranged Mutant
-
lull@acm.org
-
Wei Dai