fast 386 DES code figures
To see if software DES could really be made acceptable in a IP security protocol, I've been bumming cycles out of my old DES code. I've completely translated the encrypt and decrypt routines to assembler, with no calls or jumps inside either routine. I picked up Richard Outerbridge's seriously clever initial and final permutation algorithm from Schneier, along with a few of his other tricks. The bottom line: about 38,373 encryptions/sec (2.456 megabits/sec) on a 50 Mhz Intel 486 running in 16-bit real mode. This includes the overhead of the C loop that calls the encrypt function and prints a status line every 10,000 loops. The code would probably run faster if assembled and run in 32-bit native mode, as this would eliminate a lot of 1-clock operand size prefixes (I do many 32-bit operations). Oh, by the way, if I eliminate the permutations the speed goes up to about 42,986 encryptions/sec (2.751 megabits/sec), an increase of about 12%. That says I should be able to do triple-DES at about 13,777 blocks/sec (881.7 kbit/sec) although I haven't tried it yet. What still bugs me is that Schneier lists the speed of one commercial DES implementation as 40,600 encryptions/sec on a 33 Mhz 486. I just don't see how that's possible without using a lot more memory for lookup table space (I use only 2K, which is nice in a DOS environment). In any event, this should be enough for a T1 link (half duplex) as long as too many cycles aren't needed for things like routing packets. :-) Phil
Phil Karn wonders where all the speed comes from in reports of fast software DES. I believe that the really fast DES variants use extremely large computed-at-key-init S-box tables. As I recall, these implementations tend to pay for it in terms of setup time, which makes them less that completely appropriate for multiple IP encryption, each with its own key and where only a few dozen encryptions are done per packet. The cost to change keys is paid for either in use of memory for multiple precomputed S-box sets (an attendant swapping) or in a high key-setup to encryption ratio. For a link cipher where the key doesn't change much, these fast implementations are right. For a situation where keys change frequently, they may not be a system win. Thanks to Perry Metzger for alerting me to this issue. Eric
Phil Karn <karn@unix.ka9q.ampr.org> writes:
I've completely translated the encrypt and decrypt routines to assembler, with no calls or jumps inside either routine. I picked up Richard Outerbridge's seriously clever initial and final permutation algorithm from Schneier, along with a few of his other tricks.
I should confess that I am probably the only person on the list who has not yet read Schneier. So I apologize in advance if the following comments turn out to be redundant.
What still bugs me is that Schneier lists the speed of one commercial DES implementation as 40,600 encryptions/sec on a 33 Mhz 486. I just don't see how that's possible without using a lot more memory for lookup table space (I use only 2K, which is nice in a DOS environment).
Since 2k is exactly what is needed for a precomputed table which combines the S-boxes and the wirecrossing, I will assume this is the approach you used. Given this data structure, there are a number of cute tricks which will get DES down to around 30 machine instructions per each of the 16 rounds on a machine with enough registers and a decent set of addressing modes. The important trick is to reorder the S-boxes so that you do lookups on the odd numbered ones and the even numbered ones separately. (1,3,5,7,2,4,6,8) works nicely. This permits the results to be ORed together in two groups of four with all the necessary indexing held in a single 32 bit register, which can be appropriately repositioned each time. The precomputed key schedule needs to be adjusted to reflect the new order. Note that with this ordering, the blocks of six bits used for lookup are byte aligned if you consider the even and odd S-boxes separately. If you store the upper two bits of lookup table addressing in the precomputed key schedule and shift both it and the right hand block left two bits, all explicit table indexing vanishes and you can accumulate the result of a lookup with a single indexed OR instruction. I'm not sure what 30-something instructions per round translates into for a 33 Mhz 486, but 40,600 encryptions per second doesn't sound too outrageous using the above approach. -- Mike Duvos $ PGP 2.6 Public Key available $ mpd@netcom.com $ via Finger. $
Since 2k is exactly what is needed for a precomputed table which combines the S-boxes and the wirecrossing, I will assume this is the approach you used.
Yup, it is. I could look up more than 6 bits (i.e., more than 1 S-box) at a time, but this really starts to eat RAM.
The important trick is to reorder the S-boxes so that you do lookups on the odd numbered ones and the even numbered ones separately. (1,3,5,7,2,4,6,8) works nicely. This permits the
This is another trick from Outerbridge's code that I picked up. As you say, it does make a difference. It's especially nice in 386 assembler since I can do the key XOR E(R) AND mask in 32-bit operations, then pick off the 4 resulting bytes individually to do the SP box indexing. This trick took me from about 1.85 megabits/sec to the 2.45 megabit/sec figure I gave earlier.
If you store the upper two bits of lookup table addressing in the precomputed key schedule and shift both it and the right hand block left two bits, all explicit table indexing vanishes and you can accumulate the result of a lookup with a single indexed OR instruction.
I'm doing this too, if I understand you correctly. By left-adjusting each subkey in the key schedule (i.e., shifting the 6 bits left 2 bits), I can pre-adjust for the x4 offset I need to index the SP table, which has 4-byte elements. This saves two 32-bit shifts per round. BTW, some of the code (including Outerbridge's in Schneier) accumulates the 8 intermediate SP results by ORing into a temporary, then XORs the temporary into the output data block. This is unnecessary; each table lookup can be XORed directly into the output block. Since XOR and OR take the same time, this avoids a temporary and an extra operation. At the moment I'm really down in the noise. I've discovered that 286/386/486 specific instructions like ROR EAX,31 execute slightly faster (2 clock cycles) on the 486 than the equivalent 8086 instruction ROL EAX,1 (3 clock cycles), even though the faster instruction is more bytes. Unexpected timings occur for several other 486 instruction sequences as well, such as LODS[BW] (5 clocks), which is much slower than writing out the equivalent MOV/INC (or ADD) sequence longhand (1 clock each). I guess code size is unimportant as long as everything lands in the cache. Phil
participants (3)
-
hughes@ah.com -
mpd@netcom.com -
Phil Karn