Re: Dr. Dobbs Dev. Update 1/5 July 94 & Schneier
An interesting thought hit me when reading this. The "classic" Cray series (Cray-1, X-MP, Y-MP) all have a rather curious instruction generally known as population count. All it does is to take a register and count the number of one bits in it, and return that count. ... Just a thought. It's the only plausable use that I have yet thought of for this instruction. Has anyone else got any ideas?
This instruction would be useful in all sorts of applications. I was just wishing I had such a thing only last week. I had to write a little loop to check the number of bits set in a word. Each bit represented an action, and in my particular case it was an error if more than 1 action was requested. The loop was really a waste when you consider that it could have been done in 1 instruction. tw
From: tim werner <werner@mc.ab.com> The loop was really a waste when you consider that it could have been done in 1 instruction.
You can do better than a bit-serial loop -- though not down to one instruction! There are a lot of very cool approaches, only one of which I remember. Look at the problem as that of finding the sum of n 1-bit blocks. Well, we can easily find the sum of a single n-bit block. The intermediate conversions are the magic part. Let's look at an 8-bit word. How shall we get, for example, from a sum of 4 2-bit blocks to a sum of 2 4-bit blocks? What we do is add adjacent blocks. The block-pair sums will actually fit in three bits, so they'll certainly fit in four without overflowing. And all of this can be done bit-parallel using logic ops. In C, this looks like: int byte_ones(int a) // hope this is correct... { a = (a & 0x55) + (a & 0xAA)/2; // 0x55 == 01010101b a = (a & 0x33) + (a & 0xCC)/4; // 0x33 == 00110011b a = (a & 0x0F) + (a & 0xF0)/16; // 0x0F == 00001111b return a; } Oh, and one AND in the third line is superfluous. This is not the fastest algorithm for this, but it's the only one I understand and remember. Eli ebrandt@hmc.edu (I won't ask why you needed a one-hot encoding in the first place...)
Excerpts from internet.cypherpunks: 3-Jul-94 Re: Dr. Dobbs Dev. Update 1.. by Eli Brandt@jarthur.cs.hm
int byte_ones(int a) // hope this is correct... { a = (a & 0x55) + (a & 0xAA)/2; // 0x55 == 01010101b a = (a & 0x33) + (a & 0xCC)/4; // 0x33 == 00110011b a = (a & 0x0F) + (a & 0xF0)/16; // 0x0F == 00001111b e> return a; }
Note that some compilers might not be smart enough to use logical shift ops and instead use expensive division ops. Just to be safe... int byte_ones(int a) { a = (a & 0x55) + ((a & 0xAA) << 1); // 0x55 == 01010101b a = (a & 0x33) + ((a & 0xCC) << 2); // 0x33 == 00110011b a = (a & 0x0F) + ((a & 0xF0) << 4); // 0x0F == 00001111b return a; } And this runs in O(lg n) where n is the number of bits in `a'. Does anybody have an algorithm for this that beats O(lg n)? _____________________________________________________________________________ Tim Nali \ "We are the music makers, and we are the dreamers of tn0s@andrew.cmu.edu \ the dreams" -Willy Wonka and the Chocolate Factory
Note that some compilers might not be smart enough to use logical shift ops and instead use expensive division ops. Just to be safe...
int byte_ones(int a) { a = (a & 0x55) + ((a & 0xAA) << 1); // 0x55 == 01010101b a = (a & 0x33) + ((a & 0xCC) << 2); // 0x33 == 00110011b a = (a & 0x0F) + ((a & 0xF0) << 4); // 0x0F == 00001111b return a; }
One advantage of writing it as division is that it's hard to accidentally reverse, as above. :-) I was just trying to cut down on parens... Eli ebrandt@hmc.edu
You can do better than a bit-serial loop -- though not down to one instruction! There are a lot of very cool approaches, only one of which I remember.
Bit counting was discussed in great detail in comp.lang.c in October 1990. I saved an excellent summary by Chris Torek, which I can post if there is interest. It includes a program to test 17 different methods of bit counting, and a table of results from six machine/compiler combinations. In 5 of the 6 tested environments, the fastest method for counting the 1's in a 32-bit word turned out to be some variant of a table lookup (but not always the same variant). In 1 of the 6 tested environments, the fastest code was the following, which is similar to that posted here by Eli Brandt: /* * Explanation: * First we add 32 1-bit fields to get 16 2-bit fields. * Each 2-bit field is one of 00, 01, or 10 (binary). * We then add all the two-bit fields to get 8 4-bit fields. * These are all one of 0000, 0001, 0010, 0011, or 0100. * * Now we can do something different, becuase for the first * time the value in each k-bit field (k now being 4) is small * enough that adding two k-bit fields results in a value that * still fits in the k-bit field. The result is four 4-bit * fields containing one of {0000,0001,...,0111,1000} and four * more 4-bit fields containing junk (sums that are uninteresting). * Pictorially: * n = 0aaa0bbb0ccc0ddd0eee0fff0ggg0hhh * n>>4 = 00000aaa0bbb0ccc0ddd0eee0fff0ggg * sum = 0aaaWWWWiiiiXXXXjjjjYYYYkkkkZZZZ * where W, X, Y, and Z are the interesting sums (each at most 1000, * or 8 decimal). Masking with 0x0f0f0f0f extracts these. * * Now we can change tactics yet again, because now we have: * n = 0000WWWW0000XXXX0000YYYY0000ZZZZ * n>>8 = 000000000000WWWW0000XXXX0000YYYY * so sum = 0000WWWW000ppppp000qqqqq000rrrrr * where p and r are the interesting sums (and each is at most * 10000, or 16 decimal). The sum `q' is junk, like i, j, and * k above; but it is not necessarry to discard it this time. * One more fold, this time by sixteen bits, gives * n = 0000WWWW000ppppp000qqqqq000rrrrr * n>>16 = 00000000000000000000WWWW000ppppp * so sum = 0000WWWW000ppppp000sssss00tttttt * where s is at most 11000 and t is it most 100000 (32 decimal). * * Now we have t = r+p = (Z+Y)+(X+W) = ((h+g)+(f+e))+((d+c)+(b+a)), * or in other words, t is the number of bits set in the original * 32-bit longword. So all we have to do is return the low byte * (or low 6 bits, but `low byte' is typically just as easy if not * easier). * * This technique is also applicable to 64 and 128 bit words, but * 256 bit or larger word sizes require at least one more masking * step. */ int tG_sumbits(n) register unsigned long n; { n = (n & 0x55555555) + ((n >> 1) & 0x55555555); n = (n & 0x33333333) + ((n >> 2) & 0x33333333); n = (n + (n >> 4)) & 0x0f0f0f0f; n += n >> 8; n += n >> 16; return (n & 0xff); } --apb (Alan Barrett)
participants (4)
-
Alan Barrett -
Eli Brandt -
tim werner -
Timothy L. Nali