2025-07-08 01:37
• 19+ min read • #c #branchless
-O3 -ffast-math -march=native -fomit-this-entire-function
) were taken.Modern CPUs are predictive creatures. They guess what you're about to do, like a nosy algorithm trying to sell you sneakers after you Googled "foot pain." Branch predictors make CPUs fast by speculating on branches... until they guess wrong and everything grinds to a halt for 15,20 cycles.
Branchless programming is how we get around this: we rewrite our code to not branch at all. Instead of jumping to conclusions, we manipulate bits like 1980s assembly gremlins.
What is a branch first of all?In a program, you may observe parts like this:
if (condition), then
...
elif (condition), then
...
else
...
fi
This block of instructions is a collection of three branches. Each if
, elif
, and else
represents a possible execution path the CPU can take depending on the evaluation of the conditions. At runtime, only one of these paths is taken, and the others are skipped. This choice , this deviation in the control flow , is what we call a branch.
In terms of machine code, a branch is often implemented as a jump (jmp
, je
, jne
, etc.). These are instructions that tell the CPU: “if this condition holds, skip to label X; otherwise, keep going.” That jump disrupts the nice linear stream of instruction execution, forcing the CPU to guess where you’re going next.
Here’s a simple ASCII representation of how this decision tree looks:
[Condition A]
|
+-----+------+
Yes No
| |
[Block A] [Condition B]
|
+-----+------+
Yes No
| |
[Block B] [Block C]
From a CPU perspective, each conditional check and potential jump is a “fork in the road.” If your code has a predictable pattern (e.g., always taking the same branch), the CPU can guess well and maintain performance. But if it's unpredictable, say, random data or user input, then the CPU may guess wrong, flush its pipeline, and pay a heavy penalty.
This is why branches can be so dangerous in tight loops or performance-critical code: even one mispredicted branch can cost dozens of cycles, ruining your cache-fueled dreams.
Branches, when predictable, are cheap. But when unpredictable, they're evil. Imagine a tight loop that checks a condition based on data from user input, or real-world sensors, or shuffled arrays. The branch predictor stumbles, and every misstep means flushing the pipeline , a costly affair on modern superscalar out-of-order CPUs.
Branchless code avoids that entirely. By rewriting conditional logic into arithmetic and bit operations, or using CPU instructions like cmov
, we let the CPU chew through code without pausing to guess. It’s smoother, faster, and often more deterministic, which is crucial in performance-critical or side-channel-resistant scenarios (looking at you, cryptography).
We're going to take you on a wild ride through three increasingly complex examples:
abs(x)
, a gentle warm-up with unary funclamp(x, min, max)
, a common pattern with two conditionspartition()
, a full algorithm with data-dependent control flowWe'll compare these in C (our performance-hungry workhorse), we'll show you how these concepts look in both worlds, how they perform.
Absolute value is your first ticket to understanding how to cut down branches without cutting performance.
We want to calculate the absolute value of a signed integer without using a conditional branch. This is foundational — a single-bit operation can turn a branch into math.
int abs_branch(int x) {
return x < 0 ? -x : x;
}
// Compiles to a `cmp` and a `jge` or `jl`, depending on compiler and optimization level.
int abs_branchless(int x) {
int mask = x >> 31;
return (x + mask) ^ mask;
}
mask
to x
either increments or doesn't.mask
flips bits only if mask
is -1.int abs_alt(int x) {
int mask = x >> 31;
return (x ^ mask) - mask;
}
Produces identical results; different taste of the same bit soup.
mov eax, edi ; move x into eax
sar eax, 31 ; sign-extend right shift to produce mask
mov ecx, eax ; duplicate mask
add edi, ecx ; edi = x + mask
xor eax, edi ; eax = result = (x + mask) ^ mask
Fast. No jumps. Pure ALU (arithmetic logic unit).
The clamp is more complex. You want to bound a value between a min
and max
.We want to ensure a value stays within [min, max]
without branches. This is key in physics simulations, rendering, and safety constraints.
int clamp(int x, int min, int max) {
if (x < min) return min;
if (x > max) return max;
return x;
}
int clamp_branchless(int x, int min, int max) {
int r1 = x - ((x - min) & ((x - min) >> 31));
return r1 - ((r1 - max) & ((r1 - max) >> 31));
}
(x - min) >> 31
creates a mask that's all 1s if x < min
.min
when necessary.sub eax, min
sar eax, 31 ; create mask_low
and eax, (x - min)
sub x, eax ; x = max(x, min)
sub x, max
sar ..., 31 ; create mask_high
and ..., (x - max)
sub x, ... ; x = min(x, max)
Every operation is ALU-based. No branching, just pure logic.This removes branches by computing masks and blending values accordingly. Elegant? No. Effective? Absolutely.
This is where branchless logic makes the biggest splash, in algorithms that iterate over data and make conditional swaps.Partition an array around a pivot such that all elements < pivot come before elements >= pivot, without any conditional branching in the inner loop.
void swap(int* a, int* b) {
int temp = *a;
*a = *b;
*b = temp;
}
int partition(int* arr, int low, int high) {
int pivot = arr[high];
int i = low;
for (int j = low; j < high; j++) {
if (arr[j] < pivot) {
swap(&arr[i], &arr[j]);
i++;
}
}
swap(&arr[i], &arr[high]);
return i;
}
int partition_branchless(int* arr, int low, int high) {
int pivot = arr[high];
int i = low;
for (int j = low; j < high; j++) {
swap(&arr[i], &arr[j]);
i += arr[i] < pivot;
}
swap(&arr[i], &arr[high]);
return i;
}
i += ((arr[i] - pivot) >> 31) & 1;
Relies on arithmetic right shift and masking to conditionally increment.
mov eax, [arr+i*4]
cmp eax, pivot
setl bl
add i, ebx
Or:
cmp eax, pivot
adc i, 0
Clever use of adc
(add with carry) after compare to branchlessly increment.
| Operation | Branchy | Branchless | Speedup |
| -------------- | ------- | ---------- | ------- |
| `abs(x)` | \~5ms | \~5ms | 1.00x |
| `clamp(x,m,M)` | \~6ms | \~6ms | 1.00x |
| `partition()` | \~6ms | \~5ms | 1.20x |
abs()
and clamp()
show negligible gains; branch prediction handles them well.partition()
shows improvement due to high branch unpredictability.See appendix below for full C benchmark code, compilation flags, and timing logic.
Warning: long code ahead. Skip if you value your retina and you have a life.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
// ABS branchy and branchless
int abs_branch(int x) {
return x < 0 ? -x : x;
}
int abs_branchless(int x) {
int mask = x >> 31;
return (x + mask) ^ mask;
}
// CLAMP branchy and branchless
int clamp(int x, int min, int max) {
if (x < min) return min;
if (x > max) return max;
return x;
}
int clamp_branchless(int x, int min, int max) {
int r1 = x - ((x - min) & ((x - min) >> 31));
return r1 - ((r1 - max) & ((r1 - max) >> 31));
}
// PARTITION branchy and branchless
void swap(int* a, int* b) {
int t = *a;
*a = *b;
*b = t;
}
int partition(int* arr, int low, int high) {
int pivot = arr[high];
int i = low;
for (int j = low; j < high; j++) {
if (arr[j] < pivot) {
swap(&arr[i], &arr[j]);
i++;
}
}
swap(&arr[i], &arr[high]);
return i;
}
int partition_branchless(int* arr, int low, int high) {
int pivot = arr[high];
int i = low;
for (int j = low; j < high; j++) {
swap(&arr[i], &arr[j]);
i += arr[i] < pivot;
}
swap(&arr[i], &arr[high]);
return i;
}
// Benchmarking helpers
void benchmark_abs(int* data, int count) {
clock_t start = clock();
volatile long long sum = 0;
for (int i = 0; i < count; ++i) sum += abs_branch(data[i]);
printf("ABS (branch): %.3f sec\n", (double)(clock() - start)/CLOCKS_PER_SEC);
start = clock(); sum = 0;
for (int i = 0; i < count; ++i) sum += abs_branchless(data[i]);
printf("ABS (branchless): %.3f sec\n", (double)(clock() - start)/CLOCKS_PER_SEC);
}
void benchmark_clamp(int* data, int count) {
clock_t start = clock();
volatile long long sum = 0;
for (int i = 0; i < count; ++i) sum += clamp(data[i], -50, 50);
printf("CLAMP (branch): %.3f sec\n", (double)(clock() - start)/CLOCKS_PER_SEC);
start = clock(); sum = 0;
for (int i = 0; i < count; ++i) sum += clamp_branchless(data[i], -50, 50);
printf("CLAMP (branchless): %.3f sec\n", (double)(clock() - start)/CLOCKS_PER_SEC);
}
void benchmark_partition(int* data, int count) {
int* copy = malloc(sizeof(int) * count);
memcpy(copy, data, sizeof(int) * count);
clock_t start = clock();
partition(data, 0, count - 1);
printf("PARTITION (branch): %.3f sec\n", (double)(clock() - start)/CLOCKS_PER_SEC);
memcpy(data, copy, sizeof(int) * count);
start = clock();
partition_branchless(data, 0, count - 1);
printf("PARTITION (branchless): %.3f sec\n", (double)(clock() - start)/CLOCKS_PER_SEC);
free(copy);
}
int main() {
const int N = 10000000;
int* data = malloc(sizeof(int) * N);
// Populate with mixed signed integers
for (int i = 0; i < N; ++i)
data[i] = rand() - (RAND_MAX / 2);
puts("== Benchmarking ABS ==");
benchmark_abs(data, N);
puts("\n== Benchmarking CLAMP ==");
benchmark_clamp(data, N);
puts("\n== Benchmarking PARTITION ==");
benchmark_partition(data, 1000000); // smaller size due to O(n log n) behavior
free(data);
return 0;
}
(If the scroll wheel starts smoking, you’ve found the end.)
Branchless programming is a scalpel, not a sledgehammer. Used wisely, it can make your code faster, safer, and cooler. Misused, it turns your logic into incomprehensible bit spaghetti.
"Premature optimization is the root of all evil — except when it's branchless, then it's performance art."