[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: IDEA vs DES



At 02:09 1994/08/11 -0400, Jeremiah A Blatz wrote:
....
>PowerPC integer performance is rather impressive, i.e. faster than
>Pentium by a bit. One craveat, tho, Apple says "No!" to programming in
>assembly, and I doubt that IBM is all this happy about it either. My
>guess is that MacOS is approaching the Unix "distribute source, 'cause
>you're gonna have to do lots of re-compiles" type of thing. Just a
>guess, though. Anyway, there is one assembly interpreter out for
>PowerMacs, I don't know about the IBM PowerPCs, though.

The PowerPC floating point is even more impressive. The fmadd instruction
can do "a <- b*c+d" every other clock or 30 per microsecond on the low end
Power Mac. If we store 24 bits of a multiple precision number in successive
elements of an arrary then the inner loop of a multiply is a routine such
as:

void m8(float * a, float * b, double * p)
{p[0] = a[0]*b[0];
p[1] = a[0]*b[1] + a[1]*b[0];
p[2] = a[0]*b[2] + a[1]*b[1] + a[2]*b[0];
p[3] = a[0]*b[3] + a[1]*b[2] + a[2]*b[1] + a[3]*b[0];
p[4] = a[0]*b[4] + a[1]*b[3] + a[2]*b[2] + a[3]*b[1] + a[4]*b[0];
p[5] = a[0]*b[5] + a[1]*b[4] + a[2]*b[3] + a[3]*b[2] + a[4]*b[1] + a[5]*b[0];
....
p[13] = a[6]*b[7] + a[7]*b[6];
p[14] = a[7]*b[7];}

The overhead consisting of loads and stores can largely be hidden since the
601 can issue both a floating point and fixed point instruction in a single
clock.  1000 bit numbers can thus be multiplied in (1000/24)^2
(1/30,000,000MHz) = 59 microseconds. The outer loop is also significant but
I would expect that it can be done in under 100 microseconds. Modular
exponentiation of 1000 bit numbers should take about 2*(1000/24)^3
(1/30,000,000MHz) = 2.5 ms without outer loop overhead.

The MPW compiler from Apple doesn't compile this code well and I may have
to write it in Assembler. The documentation that comes with MPW does not
discourage assembler and MPW (from Apple) includes a great assembler!

In another context I wrote some C code that compiles some optimized 601
machine code (to move pixels fast) and executes it. You don't need no
stinking assembler.