Stabilizing Benchmarks

I do a fair amount of benchmarks as part of development, both on my own patches and while reviewing patches by others. That often requires dealing with noise, particularly for small optimizations. Here’s an overview of ways I use to filter out random variations / noise.

Most of the time it’s easy - the benefits are large and obvious. Great! But sometimes we need to care about cases when the changes are small (think less than 5%).

For example, an optimization may not apply to some cases, or maybe it’s not an optimization at all. In those cases we want to demonstrate the patch does not cause regressions. By definition the differences will be small, and we have to take care about noise. We need to filter it out from the results somehow.

I had to do a lot of this recently while working on index prefetching. I learned about a bunch of new sources of noise, and also ways to stabilize the results. I figured it might be helpful to try to organize this a little bit, both for myself and other developers.

Multiple long runs, warmup

The traditional way to stabilize results and filter out noise is to run the benchmark multiple times, and make the runs sufficiently long. This is benchmarking 101, so I’ll discuss this only very briefly.

How many and how long runs are needed? That depends on the benchmark and other things (like hardware). For a read-write pgbench, you might do 5 runs, each long enough to cover 2-3 checkpoints. Yes, that means the whole benchmark may run for a couple hours. It takes time to run a good benchmark.

For other benchmarks it may be shorter. A read-only pgbench has nothing to do with checkpoints, so runs can be much shorter (a minute, maybe?).

You’ll still need multiple runs. The goal is to see how different the results are between runs. There are statistical models that tell you how many runs you need for a desired “confidence level” (assuming the results follow e.g. normal distribution).

I usually don’t bother with that, though. I simply look how at much the results vary between runs. If the differences are significant, it means I need to stabilize them somehow. Say, by making the runs longer.

Or maybe the system needs a longer warmup, to load data into cache and so on. The required length depends on the amount of data, hardware, and so on. I sometimes run pgbench -P 1 for a while, and when the throughput stops growing it means the warmup is done.

Here’s a chart with throughput during a read-only pgbench:

It starts at ~25k tps, and it takes about 200 seconds to stabilize at ~100k tps. This implies a warmup should be at least 200 seconds.

It’s a good idea to track how much the runs vary, and take that into consideration when comparing results. Consider results for two tests, with throughputs 1000tps and 1100tps. If runs vary by up to ~200tps, does that really say the second test is faster? Or is it just luck? If the runs vary by ~20tps, it’s much clearer.

Are the results still too noisy, even with a long warmup and multiple long runs? There’s a couple lower-level tools for stabilizing results that I use. I’ll discuss them in an order I find logical, starting with the simpler ones.

Do you need to apply all of them? Certainly not. It’s more of a toolbox, and not every tool is useful for every benchmark. It’s up to you how far you choose to go.

Cold runs / dropping caches

I often need to test “cold” runs, as if no data were in memory (either in shared buffers or page cache). Without this, some runs might hit data already in cache, making it unexpectedly faster. That’s the random noise we want to eliminate.

The cleanest way would be to restart the server, but that takes quite a bit of time, so it’s not very practical.

For shared buffers this can be done by restarting the instance, or (since PG 18) using the function pg_buffercache_evict_relation.

The page cache (in Linux) contents can be dropped by writing into the proc filesystem:

echo 3 > /proc/sys/vm/drop_caches

This drops both the page cache and slab objects (dentries + inodes).

It’s not perfect. Some data may be cached in the storage device itself (e.g. SSDs often have a sizeable built-in DRAM cache). I wonder if there might be a way to drop that, but it’s probably device-specific.

MAP_POPULATE

A common issue is that the shared memory is not fully initialized when starting the instance. The kernel reserves the memory, but few pages are actually touched. That means there’ll be a lot of page faults for the first runs after that, having to populate page tables. That can be fairly expensive, and it’s a source of noise.

The easiest way to address this is to add MAP_POPULATE to the mmap calls in src/backend/port/sysv_shmem.c. I have a small patch for that, but maybe we should add this as a developer GUC.

Huge Pages

A related topic is using huge pages for shared memory, by setting

huge_pages = on

and making sure kernel has enough huge pages reserved

# sysctl -w vm.nr_hugepages=10000

This does not really remove run to run differences / noise, because all runs either use huge pages or don’t. But huge pages reduce overhead on page table management and TLB pressure. That means more time is spent in the benchmarked code, making changes in it easier to measure.

… but not Transparent Huge Pages

Explicit huge pages are helpful, but Transparent Huge Pages (THP) are a very different thing. The idea of THP is that the kernel tries to group regular memory pages allocated by an application, and replace them with huge pages. This is meant to help applications that don’t explicitly request huge pages.

Unfortunately, the effects are rather unpredictable. A lot of this happens in the background at unpredictable times, there is a “defrag” process trying to combine memory pages. This adds noise and/or stalls. There’s a reason why docs for various databases recommend disabling THP.

THP can be disabled through /sys:

# echo never > /sys/kernel/mm/transparent_hugepage/enabled
# echo never > /sys/kernel/mm/transparent_hugepage/defrag

Huge Pages for binary (tmpfs)

One surprising thing I learned only recently is that huge pages matter even for the binary, not just for “data”. And that it’s unpredictable, depending on which pages get used for the page cache. I’ve been told:

Kinda unfortunately Linux has learned to sometimes put page cache pages into huge pages in recent-ish kernels. But it happens fairly randomly and it often takes a while to happen. So you e.g. can end up with an older binary being in huge pages, but more recently built stuff not yet.

If the “old” binary is the baseline/master, and the “new” binary is the patched build we’re evaluating, this is not great. I don’t even know if there’s a way to check if a binary is backed by huge pages.

The only reason why we noticed is that we saw ~10% difference between two builds that we couldn’t explain. We struggled to figure out what’s causing it, and then noticed very different values for iTLB loads/misses in perf-stat.

There is a post from 2022 about this, demonstrating the effects. Unfortunately, the proposed changes (MADV_COLLAPSE and mremap) are not done yet.

However, there’s a way to make sure a binary uses huge pages. You can mount a tmpfs filesystem backed by huge pages, copy the build onto that filesystem, and then use the copy.

$ sudo mount -t tmpfs -o huge=always,size=2G tmpfs_pg ~/tmpfs-hp/
$ cp -Rv builds/pg-master ~/tmpfs-hp/
$ ~/tmpgs-hp/pg-master/bin/pg_ctl -D ...

This seems to do the trick for me, and until we get some patch forcing binaries into huge pages it’s the only way.

LDFLAGS="-fuse-ld=mold -Wl,–shuffle-sections"

Speaking of binaries, the binary layout matters too. A couple months ago I wrote about BOLT, focusing on optimizing the binary layout. And for some workloads the gains were pretty massive.

The post also mentions a code change may result in a build with random changes in the binary layout. That results in small small but measurable performance differences. Perhaps 1-5%, but that’s a lot of noise.

The stabilizer was meant to eliminate the noise by randomizing the binary layout, but unfortunately it seems dead.

However, there might be a way to achieve something similar with mold, using the --shuffle-sections option:

–shuffle-sections, –shuffle-sections=number: Randomize the output by shuffling the order of input sections before assigning them the offsets in the output file. If a number is given, it’s used as a seed for the random number generator, so that the linker produces the same output for the same seed. If no seed is given, a random number is used as a seed.
This option is useful for benchmarking. Modern CPUs are sensitive to a program’s memory layout. A seemingly benign change in program layout, such as a small size increase of a function in the middle of a program, can affect the program’s performance. Therefore, even if you write new code and get a good benchmark result, it is hard to say whether the new code improves the program’s performance; it is possible that the new memory layout happens to perform better.
By running a benchmark multiple times with randomized memory layouts using –shuffle-sections, you can isolate your program’s real performance number from the randomness caused by memory layout changes.

You can build with different seeds to change the binary layout, and then average (or median) results from runs on these builds. That should isilate the binary layour influence. It’s not as convenient as stabilizer (which was tweaking the layout automatically in the background). But the end result seems similar.

Here is a chart with results of read-only pgbench from 10 builds with different mold shuffle seeds (note the y-axis does not start at 0):

There are 10 runs for each build (blue points), and a per-run average (red points). Differences between the builds are pretty clear. There is a bit of variance for each run (about 1%), while the difference between the fastest and slowest run is close to 5%.

These differences are much smaller than the improvements I’ve seen with BOLT. The goal here is to filter out noise caused by binary layout changes (so that we can benchmark some other code). We’re not trying to optimize the layout, which is what BOLT does.

numactl –physcpubind

The kernel scheduler is another source of noise. The scheduler has to map processes to CPUs, moves them around, etc. When you run a script using pgbench, each connection has two sides. A client process/thread, and a backend handling the server side of the connection. The scheduler may assign both processes to the same CPU, to different CPUs on the same socket, or to CPUs on different sockets. The NUMA adds yet another dimension.

Each option has a (slightly) different performance. It shouldn’t be huge, but it’s large enough to matter for benchmarking. The scheduler may also decide to migrate the processes to different cores, which has a cost too. All of this adds a bit of noise and run to run differences.

One way to deal with this is to use numactl to pin processes to a particular CPU (or CPUs). For example, to pin pgbench to CPU 1, you can do this

$ numactl --physcpubind=1 pgbench ...

Similarly, you can use numactl when starting Postgres itself, which pins all child processes to the selected CPU(s). Which may or may not be what you want, because then all backends (and other processes) will be restricted to those CPUs.

Scheduler

A couple months ago I posted about a weird behavior I’m seeing with a certain workload. As part of that investigation (still ongoing) I’ve developed a pgbench patch allowing pinning the processes in certain ways:

none - leave it to the kernel scheduler (default)
colocated - both sides of a connection on the same CPU (core)
random - both sides of a connection on a different CPU (core)

See this for more details about the rationale of these choices.

I find the --pin-cpus colocated option useful for microbenchmarks, as it minimizes the scheduling / migration overhead.

CPU idle states

Speaking of the weird regression I posted about in June, that seems to be related to CPU entering an idle state too often. While discussing it with other engineers, I’ve heard suggestions this issue may not be all that rare. It could be happening quite a bit, but few people notice it as it’s quite hard to identify.

There is a simple way to disable this issue by forcing the kernel to do a “busy wait” while idle, instead of halting and entering a sleep. All you need to do is add

idle=poll

to the kernel command line.

Don’t do this except when benchmarking, as it has some negative effects too. First, it effectively disables power management (so the machine will use much more power). Second, it prevents Turbo Boost, as that requires most of the cores to be idle (and with idle=poll all cores will seem busy). It does not seem like a good idea for production, certainly not by default.

Maximize impact of the change

If you look at the suggestions so far, you may notice they mostly fall into one of two categories. Some reduce the amount of noise, say by removing the source (e.g. using MAP_POPULATE or disabling THP).

Other suggestions eliminate (or remove) a type of overhead. For example huge pages reduce the TLB pressure. That means the benchmarked code will use more time and thus be easier to measure.

I find this to be a very useful principle - try to cut the benchmark to the very core, by eliminating unnecessary parts. This already removes a lot of noise associated with the pieces removed from the test. Which means the benchmarked code represents a larger fraction of the duration. That makes it easier to quantify the difference. A 10% improvement in a code that’s 50% of time is easier to spot than a 10% in 5% of time.

This is a very useful benchmarking principle - remove unnecessary parts of the benchmark. Leave only code that’s actually necessary.

I neglected this principle when benchmarking the index prefetching a couple days ago. The test was setting a couple GUCs to force a plan, and a couple other parameters. It did that simply by adding a couple SET commands to the pgbench script.

A SET seems cheap, but when a query takes a fraction of a millisecond it’s actually fairly expensive. I only noticed that because one of the tests had an extra SET, and it appeared ~25% slower for no reason.

The solution was simple - don’t do any SET commands in the script. Do all of that using ALTER SYSTEM. The overhead is gone, we benchmark just the code we actually want to measure.

Similarly, it’s preferable when the test does not produce any results. Formatting data and sending it over the network (even on the same host) is quite expensive. The best option I found is use OFFSET to skip all the actual results:

SELECT * FROM (... actual query ... OFFSET 1000000000);

This skips de-TOASTing of values in the result, but that seems like a good thing. Unless you’re trying to benchmark the detoasting, of course.

I’ve seen tests using EXPLAIN ANALYZE to get timing without producing a lot of results (and I’ve used this approach for a while too). But it’s not a good approach, because EXPLAIN ANALYZE instrumentation may add quite a bit of overhead (particularly with TIMING ON). Which directly contradicts the principle, and also adds noise.

Conclusion

There is some amount of futility in this, because changes in other parts of the system can affect the outcome. Using a different CPU (maker, generation or even just a model) can change the behavior a lot. Using a different compiler (perhaps just a different version) can have similar effects too. Even two separate builds (with different binary layout) can behave very differently.

Sometimes it may seem we obsess about tiny regressions too much. If some cases get 1% slower, while other cases get 50% faster, is that a bad trade off? Probably not, assuming the improved cases are likely/common. But to evaluate these trade offs, you need to know the pros/cons first. Which requires measurements.

Do you have feedback on this post? Please reach out by e-mail to tomas@vondra.me.

Tomas Vondra