In message <199408262009.NAA17046@unix.ka9q.ampr.org> Phil Karn writes:
Yes, but even scalar multiplication is so much faster on a DSP than on most general purpose CPUs that it seems like a definite win. The 486 takes from 13-42 clock cycles to perform a multiply, depending on the operand sizes and number of significant bits in the multiplier.
The Motorola DSP96002 does an integer multiply in 2 or 3 clocks, so a 33 MHz device does 11 million multiplies (and moves) a second. The chip costs about $50. The newer TI C40 does a 32-bit integer multipy in 1 clock, so a 50 MHz device can output 200 MB/s of results. It can read in a single clock cycle but writes take two cycles (sometimes more). So although it can theoretically read 200 MB/s, it can only write 100 MB/s. However, it has six serial links, each one of which has a 20 MB/s bandwidth, so in theory it can pump out 100+120 = 220 MB/s. However, in practice you would expect the chip to be I/O bound. It costs something like $200 a chip. The real advantage of the C40 is that C40s can be connected together using their serial links. This allows them to be arranged in interesting 3D topologies. In this respect the C40 is intended to be an upgrade on the transputer, which has only four links, and tends to die when connected into large 2D meshes, because the transputers spend too much of their time passing messages. If C40s are connected in pipelines, with three links used as input from the preceding stage and three links used to drive the next stage, you can run them comfortably at 60MB/s. You might choose to do three multiplies on each 32-bit operand at this rate, giving you effectively multiplications at 45 MHz at each stage of the pipeline.
Even if you couldn't keep the pipeline full on a chip like the PowerPC, you'd still be well ahead.
Ahead of the 486 maybe, but the C40 makes the PowerPC a dog.
But then I hear people say that it's not the multiplication that slows down modular exponentiation, it's the modular reduction.
Can you elaborate? -- Jim Dixon