You can do better than a bit-serial loop -- though not down to one instruction! There are a lot of very cool approaches, only one of which I remember.
Bit counting was discussed in great detail in comp.lang.c in October 1990. I saved an excellent summary by Chris Torek, which I can post if there is interest. It includes a program to test 17 different methods of bit counting, and a table of results from six machine/compiler combinations. In 5 of the 6 tested environments, the fastest method for counting the 1's in a 32-bit word turned out to be some variant of a table lookup (but not always the same variant). In 1 of the 6 tested environments, the fastest code was the following, which is similar to that posted here by Eli Brandt: /* * Explanation: * First we add 32 1-bit fields to get 16 2-bit fields. * Each 2-bit field is one of 00, 01, or 10 (binary). * We then add all the two-bit fields to get 8 4-bit fields. * These are all one of 0000, 0001, 0010, 0011, or 0100. * * Now we can do something different, becuase for the first * time the value in each k-bit field (k now being 4) is small * enough that adding two k-bit fields results in a value that * still fits in the k-bit field. The result is four 4-bit * fields containing one of {0000,0001,...,0111,1000} and four * more 4-bit fields containing junk (sums that are uninteresting). * Pictorially: * n = 0aaa0bbb0ccc0ddd0eee0fff0ggg0hhh * n>>4 = 00000aaa0bbb0ccc0ddd0eee0fff0ggg * sum = 0aaaWWWWiiiiXXXXjjjjYYYYkkkkZZZZ * where W, X, Y, and Z are the interesting sums (each at most 1000, * or 8 decimal). Masking with 0x0f0f0f0f extracts these. * * Now we can change tactics yet again, because now we have: * n = 0000WWWW0000XXXX0000YYYY0000ZZZZ * n>>8 = 000000000000WWWW0000XXXX0000YYYY * so sum = 0000WWWW000ppppp000qqqqq000rrrrr * where p and r are the interesting sums (and each is at most * 10000, or 16 decimal). The sum `q' is junk, like i, j, and * k above; but it is not necessarry to discard it this time. * One more fold, this time by sixteen bits, gives * n = 0000WWWW000ppppp000qqqqq000rrrrr * n>>16 = 00000000000000000000WWWW000ppppp * so sum = 0000WWWW000ppppp000sssss00tttttt * where s is at most 11000 and t is it most 100000 (32 decimal). * * Now we have t = r+p = (Z+Y)+(X+W) = ((h+g)+(f+e))+((d+c)+(b+a)), * or in other words, t is the number of bits set in the original * 32-bit longword. So all we have to do is return the low byte * (or low 6 bits, but `low byte' is typically just as easy if not * easier). * * This technique is also applicable to 64 and 128 bit words, but * 256 bit or larger word sizes require at least one more masking * step. */ int tG_sumbits(n) register unsigned long n; { n = (n & 0x55555555) + ((n >> 1) & 0x55555555); n = (n & 0x33333333) + ((n >> 2) & 0x33333333); n = (n + (n >> 4)) & 0x0f0f0f0f; n += n >> 8; n += n >> 16; return (n & 0xff); } --apb (Alan Barrett)