Tomas Vondra

Tomas Vondra

blog about Postgres code and community

Don't give Postgres too much memory (even on busy systems)

A couple weeks ago I posted about how setting maintenance_work_mem too high may make things slower. Which can be surprising, as the intuition is that memory makes things faster. I got an e-mail about that post, asking if the conclusion would change on a busy system. That’s a really good question, so let’s look at it.

To paraphrase the message I got, it went something like this:

Lower maintenance_work_mem values may split the task into chunks that fit into the CPU cache. Which may end up being faster than with larger chunks.

However, what if the system is busy? Doesn’t the likelihood of the operation (e.g. CREATE INDEX) fitting into the CPU cache decrease significantly on a busy system? Couldn’t that change the behavior enough for larger maintenance_work_mem values to win?

Note: By “busy system” I understand a system with a non-trivial number of active processes, utilizing CPUs. Processes waiting on I/O won’t hit CPU caches very often, and thus won’t use a lot of it.

I think this is a perfectly reasonable question.

The CPU cache is a shared resource, and active processes need to share it in some way. For example, if there are N processes, we might assume each process gets 1/N of the CPU cache.

We can fit each process into this smaller budget (1/N of the cache), but it means more overhead later. For example, CREATE INDEX needs to sort the data, which performs mergesort. We can split the data into smaller chunks, but then we’ll need to merge more chunks later.

The intuition is that at some point the “additional overhead” increases enough to negate the benefit of chunks fitting into CPU cache. Where is this crossover point?

I don’t know. But it’s easy enough to do some tests and see if it behaves in the expected way. If not, maybe our model of dividing CPU cache into 1/N pieces is too naive?

GIN

The original post was about building GIN indexes on OSM data. But there was always just one backend running the CREATE INDEX, possibly with parallel workers. Let’s make the system busier in two different ways

First, we can run multiple CREATE INDEX concurrently, so that all backends utilize the CPU cache in exactly the same way (split the data into equally-sized chunks, etc.).

Second, we can run the CREATE INDEX, but then also some OLTP workload in the background. I chose to run a read-only pgbench with 32 clients. These OLTP backends use much less CPU cache.

Note: This is running on the same system as the original post, i.e. on a D96 instance on Azure (D96dsv6), with Xeon Platinum 8573C (96 cores, ~300MB cache), 384GB RAM, NVMe 6x RAID0.

concurrent CREATE INDEX

Here are results without concurrent pgbench.

sessionsworkers64MB1GB1GB/64MB
132133147111%
64147154104%
84324379117%
8260287110%
32111201289115%
64116221762109%

The “sessions” column says how many concurrent CREATE INDEX commands were running, and “workers” says the number of parallel workers per session. So for example 8 sessions and 4 workers means there were 32 backends (8 leaders + 24 workers) running the CREATE INDEX.

The 64MB and 1GB columns show timings for the CREATE INDEX (average over all the sessions), and the last column shows the 1GB timing as relative to 64MB (values >100% mean it’s slower).

And the conclusion is exactly the same as before - 1GB is measurably slower (~10%), even with a lot of concurrency.

concurrent CREATE INDEX + read-only pgbench

Now let’s try running a read-only pgbench in the background, with 32 clients (scale 5000, which fits into RAM, but not into shared buffers):

pgbench -n -S -c 32 -j 32 -P 1 -T $((24*3600)) test

The results look like this:

sessionsworkers64MB1GB1GB/64MB
132176197112%
64190203107%
84428494115%
8337378112%
32114991683112%
64122252305104%

Yes, it’s slower than without the pgbench, but the relative timing between 64MB and 1GB is almost exactly the same. 64MB is still ~10% faster than 1GB.

B-TREE

While at it, I decided to test this with B-tree indexes too, because what if this depends on the index type? The tests are exactly the same as for GIN, except that it builds a B-tree index on pgbench_accounts.

Or more precisely, it builds an index on a sorted or a random version of the table. That also tends to affect the build time:

-- sequential/sorted data
CREATE TABLE sequential AS SELECT * FROM pgbench_accounts ORDER BY aid;

-- randomized data
CREATE TABLE random AS SELECT * FROM pgbench_accounts ORDER BY random();

Similarly to GIN, the indexes are built with different number of sessions, workers, maintenance_work_mem values, and with/without concurrent pgbench in the background.

Note: This was done on a different D96v6 instance, with AMD EPYC 9V74 CPU (96 cores, ~200MB cache), 384GB RAM, 6x NVMe RAID0. I don’t think it makes the behavior significantly different.

concurrent CREATE INDEX

Results without the concurrent pgbench (so just the CREATE INDEX):

sessionsworkersdataset64MB1GB1GB/64MB
11random23721189%
sequential14011985%
8random828198%
sequential666295%
81random28725689%
sequential19317189%
8random14313796%
sequential12411996%

It seems for B-tree there is a small benefit of using 1GB memory limit, which in some cases results in 5-10% speedup. Which is nice, although I’m not sure it’s worth using 16x more memory.

In any case, this seems to be largely not affected by concurrency. If you compare the effect for 1 and 8 sessions, the comparisons of 64MB and 1GB timings look almost exactly the same.

Note: I did not test this with more than 8 sessions, because that makes the test significantly I/O bound. Each index is ~10GB, so with 32 sessions that’s 320GB, not including temporary files for the sort. This is too much even for the NVMe RAID, which makes the differences due to maintenance_work_mem differences mostly negligible. The GIN does not have this issue (to this extent), as the index builds take more time (spreading the writes), and the indexes are smaller.

concurrent CREATE INDEX + read-only pgbench

And results with the concurrent pgbench (again read-only, with 32 clients, just like with GIN indexes):

sessionsworkersdataset64MB1GB1GB/64MB
11random24922089%
sequential14712585%
8random878597%
sequential686696%
81random31727988%
sequential21118688%
8random16716499%
sequential14614197%

There’s virtually no difference compared to results without concurrent pgbench (durations increased, but the relative differences between 64MB and 1GB are pretty much the same).

What does this mean?

What do these results tell us? My interpretation is that the data contradict the hypothesis that on busy systems the per-process fraction of CPU cache will be too small, and high maintenance_work_mem values would win.

With GIN indexes, increasing maintanenace_work_mem never helps. 64MB is always faster than 1GB. Maybe it’d be even faster with 32MB? Could be. At some point it’ll get slower as it’ll make the temporary files larger, and the workers will get less efficient.

Builds with 1GB are consistently ~10% slower than builds with 64MB. It does not matter how many concurrent CREATE INDEX sessions or workers are there, or whether a read-only pgbench is running in the background. In particular, there is no “inversion” where 1GB would get faster than 64MB.

With B-tree indexes it’s a bit less clear. Increasing maintenance_work_mem does help a little bit, although 10% speedup for 16x more memory seems a bit underwhelming. But even here the results do not change at all depending on how busy the system is. It does not matter if there are 1 or 8 sessions, with/without pgbench.

Why?

Why doesn’t this behave in the expected way? I’m not entirely sure, but I suspect it means the model where each process gets 1/N of the CPU cache is too naive, for a couple reasons.

First, processes may use the cache very differently. The CREATE INDEX processes accumulate a fair amount of data (determined by maintenance_work_mem), and then perform a sort, which benefits from having the data in cache. OTOH the OLTP queries from pgbench process only a tiny amount of data, and don’t visit it many times.

So some processes are perfectly fine with much less data in the CPU cache, which means other processes can get a much larger fraction. Of course, all of this is managed automatically by the CPU.

Second, the processes don’t use the CPU cache the same way over time. The CREATE INDEX does not do just sorting (which is cache intensive). It first needs to read the data into memory, write the sorted data into temporary files, etc.

Let’s say the process spends only 1/3 of time in the cache intensive code (i.e. the actual sort). Well, that means other processes can use more cache 2/3 of the time. (This is somewhat imprecise, as the “time” depends on cache hits/misses. Maybe talking about instructions would be better?)

It’s entirely possible (and even probable) there are other / additional reasons explaining this behavior. I’d love to hear about them!

Conclusion

My takeaway is that the behavior does not change very much even on busy systems. The relative performance of 64MB and 1GB maintenance_work_mem remains about the same.

Note: There are other reasons to prefer lower values on busy systems. Many processes allocating a lot of memory may push other useful data from memory / page cache, and perhaps even cause OOM. But that’s not what this post is about.

I should probably re-read What Every Programmer Should Know About Memory once again, particularly the part about CPU caches. I’d bet it explains at least some of the behavior discussed in this post.

Another thing that just occurred to me is that maybe perf-stat could help with this. Surely there’s a way to measure (or estimate) how much cache a process uses, how much it would need, and so on. Unfortunately, the Azure VMs have very limited support for PMU, so I couldn’t do this.

Do you have feedback on this post? Please reach out by e-mail to tomas@vondra.me.