Multiprecision integer mult using FPU

17 Dec 2003

      Norm Hardy writes:
...
The PowerPC floating point is even more impressive. The fmadd instruction
can do "a <- b*c+d" every other clock or 30 per microsecond on the low end
Power Mac. If we store 24 bits of a multiple precision number in successive
elements of an arrary then the inner loop of a multiply is a routine such
as:
void m8(float * a, float * b, double * p)
{p[0] = a[0]*b[0];
p[1] = a[0]*b[1] + a[1]*b[0];
p[2] = a[0]*b[2] + a[1]*b[1] + a[2]*b[0];
p[3] = a[0]*b[3] + a[1]*b[2] + a[2]*b[1] + a[3]*b[0];
p[4] = a[0]*b[4] + a[1]*b[3] + a[2]*b[2] + a[3]*b[1] + a[4]*b[0];
p[5] = a[0]*b[5] + a[1]*b[4] + a[2]*b[3] + a[3]*b[2] + a[4]*b[1] + a[5]*b[0];
....
p[13] = a[6]*b[7] + a[7]*b[6];
p[14] = a[7]*b[7];}
Nice hack Norm.

This would appear to apply to any processor where the floating point
performance is substantially greater than the integer.  This is true
of the Pentium too.

Floating point:
		latency/throughput
	FADD	3/1
	FMUL	3/1

	FLD	1/1
	FST	2/2	1/1 if storing to FPU stack

Integer:
	ADD	1
	MUL	10

Multiprecision integer mult using FPU

Eric Blossom