Eric Young pointed out to me a factor of 8 error in my DES CBC block timing cacluations (I used megabytes, Erics figures are in megabytes, Peter's are in mega _bits_, this fact escaped my notice, though I must say I was very impressed with Peters optimisations :-). Peter since posted his cool keysched optimisations, which change things also. So here's the table revisited (extrapolated to 133Mhz Pentium, from Peter's 90Mhz to keep it in line with the previously posted table): keysched = 3.8us DES cbc = 7.3us n keys | key | DES | elapsed for | elapsed per | at a time | sched | cbc | n key tests | key test | keys/sec ----------------------------------------------------------------------- 1 3.4us 21.9us 25.3us 25.3us 39.5k keys/s 2 3.4us 43.8us 47.2us 23.6us 42.4k keys/s 4 3.4us 87.6us 91.0us 22.7us 44.0k keys/s 8 3.4us 175.2us 178.6us 22.3us 44.8k keys/s 16 3.4us 350.4us 353.8us 22.1us 45.2k keys/s 32 3.4us 700.8us 704.2us 22.0us 45.4k keys/s 64 3.4us 1401.6us 1405.0us 22.0us 45.6k keys/s 128 3.4us 2803.2us 2806.6us 21.9us 45.6k keys/s So as you can see this greatly reduces the gains to be made from multiple keys. Not worth doing more than 64 keys, and 64 keys only buys 15% increase in keys/sec. When Peter adds what he calls the "glue" code in his paper (extra code to move data, compare results etc), the advantage of multiple keys may go down further. Also if the extra code for testing multiple keys pushes the code requirements over the 8k L1 code cache, or the extra data space pushes the data over the 8k L1 data cache, this may lose more than is gained. The extra code complexity will add a small amount of overhead too. The data requirements for multiple keys aren't that high. (Extra data is number of blocks required per test x 8 byte block size = 64 x 8 x 3 = 1.5k, or 384 bytes if you restrict yourself to 16 keys at once, and lose the last 1% gain). Adam -- print pack"C*",split/\D+/,`echo "16iII*o\U@{$/=$z;[(pop,pop,unpack"H*",<> )]}\EsMsKsN0[lN*1lK[d2%Sa2/d0<X+d*lMLa^*lN%0]dsXx++lMlN/dsM0<J]dsJxp"|dc`