17 Dec
2003
17 Dec
'03
11:17 p.m.
Actually, in looking at the assembly code generated by three different compilers (GCC on i386, GCC on PA, and HP's PA compiler), strangely enough, the `% 256' should be `& 0xff' (it shaves a few instructions off the inner loop for some reason which isn't immediately apparant to me..). On the PA, I got a ~30% speedup by unrolling the inner loop 4x, assembling the pad into an `unsigned long', and doing one 4-byte-wide XOR with the user data. I think most of the speedup comes from giving the instruction scheduler more instructions to reorder to avoid load-store conflicts. Your milage will vary on other architectures. - Bill