Tuning the glibc memory allocator (for Postgres)

If you’ve done any Postgres development in C, you’re probably aware of the concept of memory contexts. The primary purpose of memory contexts is to absolve the developers of having to track every single piece of memory they allocated. But it’s about performance too, because memory contexts cache the memory to save on malloc/free calls. But malloc gets the memory from another allocator in libc, and each libc has its own thing. The glibc allocator has some concurrency bottlenecks (which I learned the hard way), but it’s possible to tune that.

Memory Context caching

If you ever wrote anything in C, you know how tedious it is to deal with memory. Not only do you have to allocate the memory (using malloc), you also have to track all the allocated pieces and free them at just the right moment. If you free them too early, you get use-after-free errors. If you free them too late, you get memory leaks (or at least memory usage much higher than necessary).

This is why Postgres relies on something we call “memory context.” The primary purpose of a memory context is to group allocations. You can imagine the context is keeping a list of pointers returned by malloc() (it’s much more elaborate in practice). This means that when we’re done executing a part of a query, we can simply do MemoryContextDelete and that automatically frees all the memory associated with the context. Which is super convenient.

Note: We often use the terms “context” and “allocator” as if they were synonyms, but it’s not quite correct I think. An allocator is responsible only for allocating and freeing pieces of memory, nothing more. Context is expected to do more than that - it logically “groups” the allocated memory chunks, allows freeing all the memory at once, etc. Of course, the allocator may need to do some of this internally too, but it’s not exposed to users.

I don’t want to get into the details of memory contexts - that we have a hierarchy of them, what special implementations we have, etc. There’s a README with plenty of details, and there’s also comments in each variant.

One feature I need to mention is caching of freed memory, because it’s very relevant for this post.

Note: For the rest of this post I’m going to assume the memory context is AllocSet. This is the oldest, most universal, and most common memory context implementation. Most of this equally applies to the other contexts.

Imagine you do this:

char *ptr;
ptr = palloc(128);
ptr = palloc(128);

The palloc/pfree are our malloc/free wrappers, that work with the “current” memory context. The first palloc will call malloc to get some memory from the OS - we don’t malloc just the 128B, but a larger chunk of memory - could be 1kB, for example. Then we return a pointer to the first 128B, and remember the 128B are used.

Then the second palloc arrives, asking for another 128B. But this time we don’t need to call malloc at all. We still have ~900B available from the last malloc, so we just hand out a pointer, and update the accounting of how much free space we still have. You could repeat the palloc call about 6 more times until the next malloc is needed.

This massively reduces the number of malloc calls. It’s not uncommon to see malloc allocating 1MB+ blocks that can handle thousands of palloc allocations. This matters, because the malloc calls can be quite expensive, as we’ll see later.

Consider this slight variation of the earlier example:

char *ptr;
ptr = palloc(128);
pfree(ptr);
ptr = palloc(128);

This is even less likely to call malloc, because the pfree does not actually call free. Instead, it puts the pointer onto a freelist (along with an information how large it is), and uses if for the next palloc call. You could even put the pfree+palloc into a loop, and it would never call malloc.

This caching is very important, because malloc can be surprisingly costly. As usual, it does depend on a lot of things. Firstly, malloc is not a syscall but a function provided by libc. And there are many libc implementations, each with their own malloc variant.

The Linux glibc seems to be fairly conservative, using a variant of ptmalloc, derived from the original dlmalloc allocator. FreeBSD and NetBSD switched to a completely new allocator called jemalloc, focused on SMP systems, concurrency, etc. Other operating systems do their own thing too, of course.

Each of those malloc implementations is meant to allocate memory, but the designs are very different, putting emphasis on different tradeoffs. Allocators designed for single-core systems will look very differently from allocators designed for systems with many cores.

This blog post is about the glibc allocator, which is what all Linux systems use. Which is likely what most people use for Postgres.

Limits of caching

But if we have caching built into our memory contexts, why should it matter that malloc can be expensive? Turns out there are cases when the caching can’t help.

The caching happens within a given memory context, so once that context gets destroyed, the cache disappears. This typically happens at the end of a query - we create a “per query” context, it handles all allocations for that query (many of them from the cache). Then the query completes, and we call MemoryContextDelete on all the contexts, which returns the memory back to libc. And on the next query we start from scratch, with empty caches.

We have little control over what the libc allocator does. It might have kept the memory in its own cache (so the malloc will be cheap). Or maybe it eagerly returned the memory to the OS (malloc will need a syscall, and may be expensive).

Another case is about “oversized chunks”. I said that pfree does not actually do free on the chunk, but stashes it into a freelist. But we only do that for “small” chunks, less than 8kB. The basic reasoning is that we don’t want too many freelists, we expect smaller chunks to be reused more often, and we don’t want to risk keeping large chunks unnecessarily.

Which means that if you request a chunk larger than 8kB, it will always do a malloc.

When caching does not work

But how big is the difference when the caching doesn’t work? Let’s do a quick test, on a table with many partitions. Initialize a small pgbench database (scale 1 is just 15MB) with 1000 partitions:

$ pgbench -i -s 10 --partitions=1000 test

and modify the schema by adding an index on columns that are not in the partition key. We just create a copy of the aid column because it’s simple, but it could be an arbitrary column. The point is that conditions on this column do not allow partition elimination.

ALTER TABLE pgbench_accounts ADD COLUMN aid_copy INT;
UPDATE pgbench_accounts SET aid_copy = aid;
CREATE INDEX ON pgbench_accounts(aid_copy);
VACUUM FREEZE;
VACUUM ANALYZE;

Great, now let’s run this pgbench script:

\set aid random(1, 100000 * :scale)
SELECT * FROM pgbench_accounts pa
         JOIN pgbench_branches pb ON (pa.bid = pb.bid)
        WHERE pa.aid_copy = :aid;

On my machine, this does only about 100 tps on a single client, when for a non-partitioned table it can easily do 35000 tps. For 16 clients, the difference is 700 tps vs. 340k tps. Sure, some of this is due to having to deal with 1000x more objects during planning, it’s the partitioning tax. But a factor of ~400 is maybe a bit too much?

Anyway, let’s do a bit of profiling using perf. Run the test with a single client, and collect profile using like this:

$ perf record -g -a --call-graph dwarf -- sleep 10
$ perf report > report.txt

And then do the same thing for 16 clients - be warned that can collect quite a bit of data, on my machine the perf.data is ~5GB.

Anyway, if you look into the profile for a single client, you should see something like this (you’ll need to scroll a bit):

  16.40%   0.18% postgres     postgres         [.] AllocSetAllocLarge
      |
       --16.22%--AllocSetAllocLarge
            |
            --16.06%--__GI___libc_malloc (inlined)
                 |
                  --16.03%--_int_malloc
                       |
                       |--7.03%--asm_exc_page_fault
                       |     |
                       |      --6.74%--exc_page_fault
                       |           |
                       |           --6.60%--do_user_addr_fault
                       |                |
                       |                 --...
                       |
                       |
                       |--5.60%--sysmalloc
                       |     |
                       |     |--2.64%--__glibc_morecore (inlined)
                       |     |     |
                       |     |      --2.63%--__GI___sbrk (inlined)
                       |     |           |
                       |     |           --2.63%--__brk
                       |     |                |
                       |     |                 --1.98%--...
                       |     |
                       |     |--1.97%--asm_exc_page_fault
                       |     |     |
                       |     |      --1.91%--...

Well, that’s quite a bit of time spent on malloc! In the profile for 16 clients it should be similar - I get values that are a bit lower, most likely because this is on-CPU time (so excludes lock waits etc.).

Why is this happening?

If you look at the explain for the query, you’ll get this:

                                       QUERY PLAN
-----------------------------------------------------------------------------------------
 Nested Loop
   Join Filter: (pb.bid = pa.bid)
   ->  Seq Scan on pgbench_branches pb
   ->  Append
         ->  Index Scan using pgbench_accounts_1_aid_copy_idx on pgbench_accounts_1 pa_1
               Index Cond: (aid_copy = 11)
         ->  Index Scan using pgbench_accounts_2_aid_copy_idx on pgbench_accounts_2 pa_2
               Index Cond: (aid_copy = 11)
         ->  Index Scan using pgbench_accounts_3_aid_copy_idx on pgbench_accounts_3 pa_3
               Index Cond: (aid_copy = 11)
         ->  Index Scan using pgbench_accounts_4_aid_copy_idx on pgbench_accounts_4 pa_4
               Index Cond: (aid_copy = 11)
         ...
(2004 rows)

There’s a 1000 index scans, because the condition is on a column that’s not in the partition key, and thus we can’t prune any partitions. We have to initialize each index scan, and btbeginscan does this:

IndexScanDesc
btbeginscan(Relation rel, int nkeys, int norderbys)
{
    ...

    /* allocate private workspace */
    so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));

    ...

Unfortunately, BTScanOpaqueData is pretty large - about 27kB, so it qualifies as an oversized chunk, and thus won’t be cached. Each query will initialize 1000 index scans, each doing a malloc+free, and clearly it can be quite expensive in glibc.

Tuning glibc allocator

Now we know what’s the problem, but what can we do about it? Obviously, the ideal option would be to not do the allocation. If you can skip something, that’s the best optimization of all. Sadly, it’s not clear if we can skip these allocations easily.

Maybe we could allocate just the “main” part of the BTScanOpaque struct, which is only about 100B (and thus could be cached)? And then allocate the “item” arrays opportunistically, if actually needed (which for this query would be very rare).

Note: I haven’t tried that, but it might be an interesting way to learn about the btree code, and perhaps even a nice first patch. The hard part will be to decide if this is a good idea overall, not just for one testing query.

Another option might be to allow caching for these chunks. For example, we might raise the limit from 8kB to 32kB. But that’s risky, we might just increase memory usage with no benefit. Perhaps it would be better to add a new caching layer between libc and our memory contexts, and track these chunks there. We’d still risk wasting the memory, but this cache would be shared by all memory contexts, possibly even between query executions. I actually wrote an experimental version of this, see the memory pool patch in this thread for details.

But there’s a much simpler way to improve this - glibc allows tuning the malloc behavior - e.g. how eagerly is the memory returned to the kernel. Because that’s likely the problem here - Postgres returns the memory to glibc (by calling free), and glibc hands it back to the kernel. Then we request a new chunk from glibc, and it has to get it from the kernel, incurring syscall overhead etc.

There’s a mallopt function to adjust this behavior, with a manpage describing the available tunables. The one important for this post is MALLOC_TOP_PAD, which determines how much “slack” we keep in the heap. When requesting more memory from the kernel using sbrk, we get a bit more of it. Conversely, when freeing memory, we don’t hand it back to the kernel immediately, but keep a buffer for future requests.

The default MALLOC_TOP_PAD value is 128kB, but this is clearly not enough - with 1000 index scans, we need about 27MB just for that. Let’s try bumping the value to 64MB. The easiest way is to set an environment variable:

# set MALLOC_TOP_PAD to 64MB
export MALLOC_TOP_PAD_=$((64*1024*1024))

Then restart the Postgres instance (in the same shell). And if you run the benchmarks again, you should get something like 250tps and 1350tps. That’s about 2x the original throughput - not bad for setting a glibc environment variable.

Note: Notice the trailing underscore at the end of the variable. This is intentional, not a typo - see the mallopt manpage.

Note: These experiments was done on current “master” branch, which already included commit c4d5cb71d229095a39fda1121a75ee40e6069a2a. Without that, you may be hitting a locking bottleneck, which limits the benefits of glibc tuning.

Note: If you see a smaller benefit, you may be hitting a bottleneck in handling file descriptors. Postgres has a small LRU file descriptor cache, but by default that only keeps a 1000 items. With 1000 partitions that may not be enough, leading to trashing. And on some file systems (like XFS), it’s not cheap to open/close files. So make sure to increase the value of ulimit -n, and also the max_files_per_process GUC.

Note: You might also see smaller benefits if the optimizer picks a query plan that does not use index scans (which is where the large allocations come from). For example sequential scans will be slow, but won’t hit allocation that often. Make sure to check EXPLAIN and maybe tune configuration to make random I/O look cheaper.

Conclusion

Of course, even these improved values are still much lower than results from tests without partitioning. There’s clearly a cost for processing many more relations. If you have 1000 relations instead of 1, it would be weird if planning/execution wasn’t more expensive.

In this case the cost (throughput drops to 1/400) is extreme, partially due to how the example is constructed. The dataset is tiny (so no I/O), and the query has to touch all the partitions. It’s not entirely crazy, though. Machines with enough RAM to keep everything in-memory are common, and no partitioning scheme works for all queries.

It’s advisable to pick a conservative number of partitions - don’t use too many. For example if you want to archive data every month, go with monthly partitions, don’t go with daily partitions “just in case”.

Similarly, be careful when designing the partitioning scheme. It will determine which queries can benefit from partition elimination. No scheme is perfect for all queries - it’s a tradeoff. Pick a schema that works well for the important / frequent queries.

We’ll continue improving scalability with many partitions. But that just means costs like allocation will have more impact in the future. And so tuning glibc (and similar) will be more important.

Do you have feedback on this post? Please reach out by e-mail to tomas@vondra.me.

Tomas Vondra