Learn more about Israeli genocide in Gaza, funded by the USA, Germany, the UK and others.

Rounding up to the next power of two in C

I was writing a memory allocator in C. My approach was to pad each allocation to a power of two number of bytes so that the allocations could be cleanly packed together. For example, if the program does malloc(49), requesting a 49-byte memory allocation, we round this up to 64 bytes. For this, I wanted a function next_pow2 so that next_pow2(49) = 64, next_pow2(64) = 64, next_pow2(65) = 128, et cetera.

How do we implement next_pow2? A natural implementation is to check all the powers of two, stopping at the first one which is big enough:

uint64_t next_pow2(uint64_t x) {
  uint64_t p = 1;
  while (p < x) p *= 2;
  return p;
}

The above algorithm uses ordinary arithmetic. We can do better if we work with the binary representation of numbers in C. In binary representation, adding a 0 doubles the number, just as adding a 0 in decimal multiplies by ten. So p * 2 can be done as p << 1, using the bitwise left shift operator << to shifts the number up by one and add a rightmost 0:

uint64_t next_pow2(uint64_t x) {
  uint64_t p = 1;
  while (p < x) p <<= 1;
  return p;
}

For a 64-bit number, the while loop can still take up to 64 iterations. The number of operations is linear in the number of bits. We can do better and make this a logarithmic number of operations. One logarithmic-time algorithm is to use binary search instead of linear search:

uint64_t pow2(uint8_t e) { return ((uint64_t)1) << e; }

uint64_t next_pow2(uint64_t x) {
  uint8_t lo = 0, hi = 63;
  while (lo < hi) {
    uint8_t test = (lo+hi)/2;
    if (x < pow2(test)) { hi = test; }
    else if (pow2(test) < x) { lo = test+1; }
    else { return pow2(test); }
  }
  return pow2(lo);
}

The above logarithmic-time approach has a lot of branching. There is another logarithmic-time algorithm which uses no branching, which I originally found here. It relies on a function next_pow2m1, which finds the next number of the form 2^n - 1:

uint64_t next_pow2(uint64_t x) {
	return next_pow2m1(x-1)+1;
}

To implement next_pow2m1, let’s see an example: next_pow2m1(0b00010100) = 0b00011111. Notice that in binary representation, this is equivalent to “fill with 1s everything after the first 1”. We can do that with iterated bitwise or:

uint64_t next_pow2m1(uint64_t x) {
	for (int i = 0; i < 63; i++) x |= x>>1;
  return x;
}

We can make this branchless by expanding the loop:

uint64_t next_pow2m1(uint64_t x) {
  x |= x>>1;    // Iteration 1
  x |= x>>1;    // Iteration 2
  ...
  x |= x>>1;    // Iteration 63
  return x;
}

Here’s what happens in each iteration:

x = 0010000000000000
x = 0011000000000000
x = 0011100000000000
x = 0011110000000000
x = 0011111000000000
x = 0011111100000000
x = 0011111110000000
x = 0011111111000000
x = 0011111111100000
x = 0011111111110000
x = 0011111111111000
x = 0011111111111100
x = 0011111111111110
x = 0011111111111111

This is still linear-time. But we can do this faster by reusing the 1 bits from previous iterations!:

uint64_t next_pow2m1(uint64_t x) {
  x |= x>>1;
	x |= x>>2;
	x |= x>>4;
	x |= x>>8;
	x |= x>>16;
	x |= x>>32;
  return x;
}

Now see how the rightmost bit doubles in size each time:

x = 0010000000000000
x = 0011000000000000
x = 0011110000000000
x = 0011111111000000
x = 0011111111111111

This is the most efficient “portable C” implementation of next_pow2 I’ve seen. However, there’s a still more efficient constant-time implementation. The above algorithm effectively finds the rightmost 1 bit, but there’s a constant-time function for that: __builtin_clzl, “count leading zeros”. This is provided by GCC (not portable C). It does have a gotcha: __builtin_clzl(0) is undefined, so we need to special-case for it. Here’s our final constant-time implementation of next_pow2:

uint64_t next_pow2(uint64_t x) {
	return x == 1 ? 1 : 1<<(64-__builtin_clzl(x-1));
}

But how is __builtin_clzl implemented? We can see by compiling to assembly:

$ cc -O0 -std=c99 -S next_pow2.c

We get these instructions:

movq	$42, -8(%rbp)
bsrq	-8(%rbp), %rax

The bsrq instruction is “Bit Scan Reverse”. As usual, there is literally no official documentation on this instruction.

Tagged #programming, #c.

Similar posts

More by Jim

Want to build a fantastic product using LLMs? I work at Granola where we're building the future IDE for knowledge work. Come and work with us! Read more or get in touch!

This page copyright James Fisher 2018. Content is not associated with my employer. Found an error? Edit this page.