
On Sat, 20 Jan 1996, John Lull wrote:
I didn't see your original post, but did see Deranged's response. I would be interested to see whatever you come up with.
I ended up doing it like this: for (i=0; i<4; i++) { FF_42(t7, t6, t5, t4, t3, t2, t1, t0, w[wi2[8*i+0]], mc2[8*i+0]); FF_42(t6, t5, t4, t3, t2, t1, t0, t7, w[wi2[8*i+1]], mc2[8*i+1]); FF_42(t5, t4, t3, t2, t1, t0, t7, t6, w[wi2[8*i+2]], mc2[8*i+2]); FF_42(t4, t3, t2, t1, t0, t7, t6, t5, w[wi2[8*i+3]], mc2[8*i+3]); FF_42(t3, t2, t1, t0, t7, t6, t5, t4, w[wi2[8*i+4]], mc2[8*i+4]); FF_42(t2, t1, t0, t7, t6, t5, t4, t3, w[wi2[8*i+5]], mc2[8*i+5]); FF_42(t1, t0, t7, t6, t5, t4, t3, t2, w[wi2[8*i+6]], mc2[8*i+6]); FF_42(t0, t7, t6, t5, t4, t3, t2, t1, w[wi2[8*i+7]], mc2[8*i+7]); } This allows all the transform functions to fit into L1 cache, but at a cost. Besides the overhead of the for loop, each macro call now does two extra table lookups (in wi2 and mc2). The net result is a ~100% speedup over the reference implementation. Also, FYI, the boolean functions used in the reference implementation can be optimized. Thanks to Deranged Mutant for these: /* #define f_2(x6, x5, x4, x3, x2, x1, x0) \ ((x2) & ((x1) & ~(x3) ^ (x4) & (x5) ^ (x6) ^ (x0)) ^ \ (x4) & ((x1) ^ (x5)) ^ (x3) & (x5) ^ (x0)) */ #define f_2(x6, x5, x4, x3, x2, x1, x0) \ (((x4&x5)|x2) ^ (x0|x2) ^ x2&(x1&(~x3)^x6) ^ x3&x5 ^ x1&x4) /* #define f_4(x6, x5, x4, x3, x2, x1, x0) \ ((x4) & ((x5) & ~(x2) ^ (x3) & ~(x6) ^ (x1) ^ (x6) ^ (x0)) ^ \ (x3) & ((x1) & (x2) ^ (x5) ^ (x6)) ^ \ (x2) & (x6) ^ (x0)) */ #define f_4(x6, x5, x4, x3, x2, x1, x0) \ ((((~x2&x5)^(x3|x6)^x1^x0)&x4) ^ ((x1&x2^x5^x6)&x3) ^ (x2&x6) ^ x0) /* #define f_5(x6, x5, x4, x3, x2, x1, x0) \ ((x0) & ((x1) & (x2) & (x3) ^ ~(x5)) ^ \ (x1) & (x4) ^ (x2) & (x5) ^ (x3) & (x6)) */ #define f_5(x6, x5, x4, x3, x2, x1, x0) \ ((((x0&x2&x3)^x4)&x1) ^ ((x0^x2)&x5) ^ (x3&x6) ^ x0) Wei Dai