2) If you throw TCP processing in there, unless you are consistantly going to have packets on the order of at least 1000 bytes, your crypto algorithm is almost _irrelevant_.
This is my experience, too. And I would add "and lots of packets". The only crypto "overhead" that really mattered in a real application was the number of round-trip times it took to negotiate protocols and keys. Crypto's CPU time is very very seldom the limiting factor in real end-user application performance.
Could the lack of support for TCP offload in Linux have skewed these figures somewhat? It could be that the caveat for the results isn't so much "this was done ten years ago" as "this was done with a TCP stack that ignores the hardware's advanced capabilities".
I have never seen a network card or chip whose "advanced capabilities" included the ability to speed up TCP. Most such "advanced" designs actually ran slower than merely doing TCP in the Linux kernel using an uncomplicated chip. I saw a Patent Office procurement of Suns in the '80s that demanded these slow "TCP offload" boards (I had to write the bootstrap code for the project) even though the motherboard came with an Ethernet chip and software stack that could run TCP *at wire speed* all day and night -- for free. The super whizzo board couldn't even send back-to-back packets, as I recall. Some government contractor had added the "TCP offload" requirement, presumably to inflate the price that they were adding a percentage markup to. As a crypto-relevant aside, last year I looked at using the "crypto offload" engine in the AMD Geode cpu chip to speed up Linux crypto operations in the OLPC. There was even a nice driver for it. Summary: useless. It had been designed by somebody who had no idea of the architecture of modern software. The crypto engine used DMA "for speed", used physical rather than virtual addresses, and stored the keys internally in its registers -- so it couldn't work with virtual memory, and couldn't conveniently be shared between two different processes. It was SO much faster to do your crypto "by hand" in a shared library in a user process, than to cross into the kernel, copy the data to be in contiguous memory locations (or manually translate the addresses and lock down those pages into physical memory), copy the keys and IVs into the accelerator, do the crypto, copy the results back into virtual memory, and reschedule the user process. In typical applications (which don't always use the same key) you'd need to do this dance once for every block encrypted, or perhaps if you were lucky, for every packet. Even kernel crypto wasn't worth doing through the thing. And the software libraries were not only faster, they were also portable, running on anything, not just one obsolete chip. Hardware guys are just jerking off unless they spend a lot of time with software guys AT THE DESIGN STAGE before they lay out a single gate. One stupid design decision can take away all the potential gain. Every TCP offloader I've seen has had at least one. John --------------------------------------------------------------------- The Cryptography Mailing List Unsubscribe by sending "unsubscribe cryptography" to majordomo@metzdowd.com ----- End forwarded message ----- -- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
participants (1)
-
John Gilmore