[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Dr. Dobbs Dev. Update 1/5 July 94 & Schneier
> You can do better than a bit-serial loop -- though not down to
> one instruction! There are a lot of very cool approaches, only
> one of which I remember.
Bit counting was discussed in great detail in comp.lang.c in October
1990. I saved an excellent summary by Chris Torek, which I can post if
there is interest. It includes a program to test 17 different methods
of bit counting, and a table of results from six machine/compiler
combinations.
In 5 of the 6 tested environments, the fastest method for counting the
1's in a 32-bit word turned out to be some variant of a table lookup
(but not always the same variant). In 1 of the 6 tested environments,
the fastest code was the following, which is similar to that posted here
by Eli Brandt:
/*
* Explanation:
* First we add 32 1-bit fields to get 16 2-bit fields.
* Each 2-bit field is one of 00, 01, or 10 (binary).
* We then add all the two-bit fields to get 8 4-bit fields.
* These are all one of 0000, 0001, 0010, 0011, or 0100.
*
* Now we can do something different, becuase for the first
* time the value in each k-bit field (k now being 4) is small
* enough that adding two k-bit fields results in a value that
* still fits in the k-bit field. The result is four 4-bit
* fields containing one of {0000,0001,...,0111,1000} and four
* more 4-bit fields containing junk (sums that are uninteresting).
* Pictorially:
* n = 0aaa0bbb0ccc0ddd0eee0fff0ggg0hhh
* n>>4 = 00000aaa0bbb0ccc0ddd0eee0fff0ggg
* sum = 0aaaWWWWiiiiXXXXjjjjYYYYkkkkZZZZ
* where W, X, Y, and Z are the interesting sums (each at most 1000,
* or 8 decimal). Masking with 0x0f0f0f0f extracts these.
*
* Now we can change tactics yet again, because now we have:
* n = 0000WWWW0000XXXX0000YYYY0000ZZZZ
* n>>8 = 000000000000WWWW0000XXXX0000YYYY
* so sum = 0000WWWW000ppppp000qqqqq000rrrrr
* where p and r are the interesting sums (and each is at most
* 10000, or 16 decimal). The sum `q' is junk, like i, j, and
* k above; but it is not necessarry to discard it this time.
* One more fold, this time by sixteen bits, gives
* n = 0000WWWW000ppppp000qqqqq000rrrrr
* n>>16 = 00000000000000000000WWWW000ppppp
* so sum = 0000WWWW000ppppp000sssss00tttttt
* where s is at most 11000 and t is it most 100000 (32 decimal).
*
* Now we have t = r+p = (Z+Y)+(X+W) = ((h+g)+(f+e))+((d+c)+(b+a)),
* or in other words, t is the number of bits set in the original
* 32-bit longword. So all we have to do is return the low byte
* (or low 6 bits, but `low byte' is typically just as easy if not
* easier).
*
* This technique is also applicable to 64 and 128 bit words, but
* 256 bit or larger word sizes require at least one more masking
* step.
*/
int
tG_sumbits(n)
register unsigned long n;
{
n = (n & 0x55555555) + ((n >> 1) & 0x55555555);
n = (n & 0x33333333) + ((n >> 2) & 0x33333333);
n = (n + (n >> 4)) & 0x0f0f0f0f;
n += n >> 8;
n += n >> 16;
return (n & 0xff);
}
--apb (Alan Barrett)