Re: Counting bits

17 Dec 2003

      -----BEGIN PGP SIGNED MESSAGE-----
...
Just for entertainment value, I clipped your function and compiled it
with Turbo C++ 1.01 in default (ANSI C) mode.  Here's the .asm code
produced (comments and setup code edited for brevity)
Both Sun C and GCC on a Sun SPARC system running 4.1.3 produced this code
for each bit-count line (-O4 optimization used):

L77042:
        andcc   %o0,2,%g0		; AND the bit
        bne,a   L77044			; branch/anull if zero
        inc     %o5			; increment bitcount
L77044:

This, I believe, is as optimized as it is possible to get on a uniprocessor
machine.

On both compilers, the routine size was 28 instructions total, and that
would also be the maximum path length for the execution of this routine
when passed an ASCII 255 value.

A MIPS-based DECserver running Ultrix 7.1 produced this (again, -O4):

$34:
        lb      $11, 0($sp)		; Load the byte off the stack
        and     $12, $11, 16		; AND the bit
        beq     $12, 0, $35		; branch/anull if zero
        addu    $3, $3, 1		; increment bitcount
$35:

Total instruction count was 28.  This is non-optimal, as there is no
need to reload off the top of the stack on every line, and if so 
modified it would be equivalently efficient to the SPARC implementation.

On a Cray Y-MP/EL running UNICOS 7.0.6 (-O3, which is equivalent to
- -hinline3,scalar3,task3,vector3):

L5          =               P.*
            S7              2		; Move 2 into S7
            S0              S2&S7	; S0 = S2 AND S7
            JSZ             L6		; Jump to L6 if the bit was zero
            S7              1		; Move 1 into S7
            S1              S1+S7	; Up the bitcount in S1
L6          =               P.*                             ;               9

Note that the Cray C compiler (or indeed any C compiler I know of) is not
yet capable of recognising the option of using the population count
instruction here, because it is nearly impossible to determine what this
particular routine is doing.  Even so, the total instruction count is
80, which is somewhat excessive.  The "Move 1 into S7" could probably
be eliminated by using another scalar register, and I suspect (but don't
have the manual here so I cannot confirm) that they'd be better not
to reload the mask every line, but instead to load it once and shift.
Additionally, you could probably vectorise this, but I doubt it would
buy you much.

Anyway, that's an analysis of three high end architectures on this
code fragment.  Personally I feel that a lookup table would be a MUCH
more efficient implementation for most systems which lack population
count, even for words up to 20 bits or so in size (depending on your storage
requirements and latency at accessing main memory, of course).

Enjoy.  One of these days I will get back to my project of implementing
crypto primatives in CAL, but I do not have the time right now.
BTW, folks, playing around with this is fun.  I still believe that either
the SKIPJACK interim reports Cray-implementation timing figures were
wrong, or the conditions under which the program was compiled was
incorrect (most likely), or that SKIPJACK contains no s-boxes.
Take your pick.

						Ian.

-----BEGIN PGP SIGNATURE-----
Version: 2.3

iQCVAgUBLhukvdCZASdT8NoBAQHe/wQAzW/zmoiiAz9vswLO5kQcs6TSoAhIK7SM
1hTrvbXTbNwrnK2FyhC4nZaUPIjnZufOeCoQPs1DJNsCZ1q6Gx1nlVj/hTyBUxYr
THQ9ZLOUFruSDa18enx4J1iSrliBeoGcV0CuGRxClNoFrDkYedzRS0nN+m/rq35W
Vcsk0HFxq0g=
=Wpri
-----END PGP SIGNATURE-----

Re: Counting bits

Ian Farquhar