At 02:09 1994/08/11 -0400, Jeremiah A Blatz wrote: ....
PowerPC integer performance is rather impressive, i.e. faster than Pentium by a bit. One craveat, tho, Apple says "No!" to programming in assembly, and I doubt that IBM is all this happy about it either. My guess is that MacOS is approaching the Unix "distribute source, 'cause you're gonna have to do lots of re-compiles" type of thing. Just a guess, though. Anyway, there is one assembly interpreter out for PowerMacs, I don't know about the IBM PowerPCs, though.
The PowerPC floating point is even more impressive. The fmadd instruction can do "a <- b*c+d" every other clock or 30 per microsecond on the low end Power Mac. If we store 24 bits of a multiple precision number in successive elements of an arrary then the inner loop of a multiply is a routine such as: void m8(float * a, float * b, double * p) {p[0] = a[0]*b[0]; p[1] = a[0]*b[1] + a[1]*b[0]; p[2] = a[0]*b[2] + a[1]*b[1] + a[2]*b[0]; p[3] = a[0]*b[3] + a[1]*b[2] + a[2]*b[1] + a[3]*b[0]; p[4] = a[0]*b[4] + a[1]*b[3] + a[2]*b[2] + a[3]*b[1] + a[4]*b[0]; p[5] = a[0]*b[5] + a[1]*b[4] + a[2]*b[3] + a[3]*b[2] + a[4]*b[1] + a[5]*b[0]; .... p[13] = a[6]*b[7] + a[7]*b[6]; p[14] = a[7]*b[7];} The overhead consisting of loads and stores can largely be hidden since the 601 can issue both a floating point and fixed point instruction in a single clock. 1000 bit numbers can thus be multiplied in (1000/24)^2 (1/30,000,000MHz) = 59 microseconds. The outer loop is also significant but I would expect that it can be done in under 100 microseconds. Modular exponentiation of 1000 bit numbers should take about 2*(1000/24)^3 (1/30,000,000MHz) = 2.5 ms without outer loop overhead. The MPW compiler from Apple doesn't compile this code well and I may have to write it in Assembler. The documentation that comes with MPW does not discourage assembler and MPW (from Apple) includes a great assembler! In another context I wrote some C code that compiles some optimized 601 machine code (to move pixels fast) and executes it. You don't need no stinking assembler.
Norm Hardy writes:
The PowerPC floating point is even more impressive. The fmadd instruction can do "a <- b*c+d" every other clock or 30 per microsecond on the low end Power Mac. If we store 24 bits of a multiple precision number in successive elements of an arrary then the inner loop of a multiply is a routine such as:
void m8(float * a, float * b, double * p) {p[0] = a[0]*b[0]; p[1] = a[0]*b[1] + a[1]*b[0]; p[2] = a[0]*b[2] + a[1]*b[1] + a[2]*b[0]; p[3] = a[0]*b[3] + a[1]*b[2] + a[2]*b[1] + a[3]*b[0]; p[4] = a[0]*b[4] + a[1]*b[3] + a[2]*b[2] + a[3]*b[1] + a[4]*b[0]; p[5] = a[0]*b[5] + a[1]*b[4] + a[2]*b[3] + a[3]*b[2] + a[4]*b[1] + a[5]*b[0]; .... p[13] = a[6]*b[7] + a[7]*b[6]; p[14] = a[7]*b[7];}
Nice hack Norm. This would appear to apply to any processor where the floating point performance is substantially greater than the integer. This is true of the Pentium too. Floating point: latency/throughput FADD 3/1 FMUL 3/1 FLD 1/1 FST 2/2 1/1 if storing to FPU stack Integer: ADD 1 MUL 10
participants (2)
-
Eric Blossom
-
norm@netcom.com