Tuning the glibc memory allocator (for Postgres)
If you’ve done any Postgres development in C, you’re probably aware of
the concept of memory contexts. The primary purpose of memory contexts
is to absolve the developers of having to track every single piece of
memory they allocated. But it’s about performance too, because memory
contexts cache the memory to save on malloc
/free
calls. But malloc
gets the memory from another allocator in libc
, and each libc
has
its own thing. The glibc
allocator has some concurrency bottlenecks
(which I learned the hard way), but it’s possible to tune that.
Memory Context caching
If you ever wrote anything in C, you know how tedious it is to deal with
memory. Not only do you have to allocate the memory (using malloc
), you
also have to track all the allocated pieces and free them at just the
right moment. If you free them too early, you get use-after-free errors.
If you free them too late, you get memory leaks (or at least memory
usage much higher than necessary).
This is why Postgres relies on something we call “memory context.” The
primary purpose of a memory context is to group allocations. You can
imagine the context is keeping a list of pointers returned by malloc()
(it’s much more elaborate in practice). This means that when we’re done
executing a part of a query, we can simply do MemoryContextDelete
and
that automatically frees all the memory associated with the context.
Which is super convenient.
Note: We often use the terms “context” and “allocator” as if they were synonyms, but it’s not quite correct I think. An allocator is responsible only for allocating and freeing pieces of memory, nothing more. Context is expected to do more than that - it logically “groups” the allocated memory chunks, allows freeing all the memory at once, etc. Of course, the allocator may need to do some of this internally too, but it’s not exposed to users.
I don’t want to get into the details of memory contexts - that we have a hierarchy of them, what special implementations we have, etc. There’s a README with plenty of details, and there’s also comments in each variant.
One feature I need to mention is caching of freed memory, because it’s very relevant for this post.
Note: For the rest of this post I’m going to assume the memory context is AllocSet. This is the oldest, most universal, and most common memory context implementation. Most of this equally applies to the other contexts.
Imagine you do this:
char *ptr;
ptr = palloc(128);
ptr = palloc(128);
The palloc/pfree
are our malloc/free
wrappers, that work with the
“current” memory context. The first palloc
will call malloc
to get
some memory from the OS - we don’t malloc
just the 128B, but a larger
chunk of memory - could be 1kB, for example. Then we return a pointer
to the first 128B, and remember the 128B are used.
Then the second palloc
arrives, asking for another 128B. But this time
we don’t need to call malloc
at all. We still have ~900B available
from the last malloc, so we just hand out a pointer, and update the
accounting of how much free space we still have. You could repeat the
palloc
call about 6 more times until the next malloc
is needed.
This massively reduces the number of malloc
calls. It’s not uncommon
to see malloc
allocating 1MB+ blocks that can handle thousands of
palloc
allocations. This matters, because the malloc
calls can be
quite expensive, as we’ll see later.
Consider this slight variation of the earlier example:
char *ptr;
ptr = palloc(128);
pfree(ptr);
ptr = palloc(128);
This is even less likely to call malloc
, because the pfree
does not
actually call free
. Instead, it puts the pointer onto a freelist
(along with an information how large it is), and uses if for the next
palloc
call. You could even put the pfree+palloc
into a loop, and
it would never call malloc
.
This caching is very important, because malloc
can be surprisingly
costly. As usual, it does depend on a lot of things. Firstly, malloc
is not a syscall but a function provided by libc
. And there are many
libc
implementations, each with their own malloc
variant.
The Linux glibc
seems to be fairly conservative, using a variant of ptmalloc
, derived
from the original dlmalloc
allocator. FreeBSD and NetBSD switched to
a completely new allocator called jemalloc,
focused on SMP systems, concurrency, etc. Other operating systems do
their own thing too, of course.
Each of those malloc
implementations is meant to allocate memory, but
the designs are very different, putting emphasis on different tradeoffs.
Allocators designed for single-core systems will look very differently
from allocators designed for systems with many cores.
This blog post is about the glibc
allocator, which is what all Linux
systems use. Which is likely what most people use for Postgres.
Limits of caching
But if we have caching built into our memory contexts, why should it
matter that malloc
can be expensive? Turns out there are cases when
the caching can’t help.
The caching happens within a given memory context, so once that context
gets destroyed, the cache disappears. This typically happens at the end
of a query - we create a “per query” context, it handles all allocations
for that query (many of them from the cache). Then the query completes,
and we call MemoryContextDelete
on all the contexts, which returns the
memory back to libc
. And on the next query we start from scratch, with
empty caches.
We have little control over what the libc
allocator does. It might
have kept the memory in its own cache (so the malloc
will be cheap).
Or maybe it eagerly returned the memory to the OS (malloc
will need
a syscall, and may be expensive).
Another case is about “oversized chunks”. I said that pfree
does not
actually do free
on the chunk, but stashes it into a freelist
. But
we only do that for “small” chunks, less than 8kB. The basic reasoning
is that we don’t want too many freelists, we expect smaller chunks to be
reused more often, and we don’t want to risk keeping large chunks
unnecessarily.
Which means that if you request a chunk larger than 8kB, it will always
do a malloc
.
When caching does not work
But how big is the difference when the caching doesn’t work? Let’s do a quick test, on a table with many partitions. Initialize a small pgbench database (scale 1 is just 15MB) with 1000 partitions:
$ pgbench -i -s 1 --partitions=1000 test
and modify the schema by adding an index on columns that are not in the
partition key. We just create a copy of the aid
column because it’s
simple, but it could be an arbitrary column. The point is that
conditions on this column do not allow partition elimination.
ALTER TABLE pgbench_accounts ADD COLUMN aid_copy INT;
UPDATE pgbench_accounts SET aid_copy = aid;
CREATE INDEX ON pgbench_accounts(aid_copy);
VACUUM FREEZE;
VACUUM ANALYZE;
Great, now let’s run this pgbench
script:
\set aid random(1, 100000 * :scale)
SELECT * FROM pgbench_accounts pa
JOIN pgbench_branches pb ON (pa.bid = pb.bid)
WHERE pa.aid_copy = :aid;
On my machine, this does only about 100 tps on a single client, when for a non-partitioned table it can easily do 35000 tps. For 16 clients, the difference is 700 tps vs. 340k tps. Sure, some of this is due to having to deal with 1000x more objects during planning, it’s the partitioning tax. But a factor of ~400 is maybe a bit too much?
Anyway, let’s do a bit of profiling using perf
. Run the test with a
single client, and collect profile using like this:
$ perf record -g -a --call-graph dwarf -- sleep 10
$ perf report > report.txt
And then do the same thing for 16 clients - be warned that can collect
quite a bit of data, on my machine the perf.data
is ~5GB.
Anyway, if you look into the profile for a single client, you should see something like this (you’ll need to scroll a bit):
16.40% 0.18% postgres postgres [.] AllocSetAllocLarge
|
--16.22%--AllocSetAllocLarge
|
--16.06%--__GI___libc_malloc (inlined)
|
--16.03%--_int_malloc
|
|--7.03%--asm_exc_page_fault
| |
| --6.74%--exc_page_fault
| |
| --6.60%--do_user_addr_fault
| |
| --...
|
|
|--5.60%--sysmalloc
| |
| |--2.64%--__glibc_morecore (inlined)
| | |
| | --2.63%--__GI___sbrk (inlined)
| | |
| | --2.63%--__brk
| | |
| | --1.98%--...
| |
| |--1.97%--asm_exc_page_fault
| | |
| | --1.91%--...
Well, that’s quite a bit of time spent on malloc! In the profile for 16 clients it should be similar - I get values that are a bit lower, most likely because this is on-CPU time (so excludes lock waits etc.).
Why is this happening?
If you look at the explain for the query, you’ll get this:
QUERY PLAN
-----------------------------------------------------------------------------------------
Nested Loop
Join Filter: (pb.bid = pa.bid)
-> Seq Scan on pgbench_branches pb
-> Append
-> Index Scan using pgbench_accounts_1_aid_copy_idx on pgbench_accounts_1 pa_1
Index Cond: (aid_copy = 11)
-> Index Scan using pgbench_accounts_2_aid_copy_idx on pgbench_accounts_2 pa_2
Index Cond: (aid_copy = 11)
-> Index Scan using pgbench_accounts_3_aid_copy_idx on pgbench_accounts_3 pa_3
Index Cond: (aid_copy = 11)
-> Index Scan using pgbench_accounts_4_aid_copy_idx on pgbench_accounts_4 pa_4
Index Cond: (aid_copy = 11)
...
(2004 rows)
There’s a 1000 index scans, because the condition is on a column that’s
not in the partition key, and thus we can’t prune any partitions. We
have to initialize each index scan, and btbeginscan
does this:
IndexScanDesc
btbeginscan(Relation rel, int nkeys, int norderbys)
{
...
/* allocate private workspace */
so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
...
Unfortunately, BTScanOpaqueData
is pretty large - about 27kB, so it
qualifies as an oversized chunk, and thus won’t be cached. Each
query will initialize 1000 index scans, each doing a malloc+free
, and
clearly it can be quite expensive in glibc
.
Tuning glibc allocator
Now we know what’s the problem, but what can we do about it? Obviously, the ideal option would be to not do the allocation. If you can skip something, that’s the best optimization of all. Sadly, it’s not clear if we can skip these allocations easily.
Maybe we could allocate just the “main” part of the BTScanOpaque
struct, which is only about 100B (and thus could be cached)? And then
allocate the “item” arrays opportunistically, if actually needed (which
for this query would be very rare).
Note: I haven’t tried that, but it might be an interesting way to learn about the btree code, and perhaps even a nice first patch. The hard part will be to decide if this is a good idea overall, not just for one testing query.
Another option might be to allow caching for these chunks. For example,
we might raise the limit from 8kB to 32kB. But that’s risky, we might
just increase memory usage with no benefit. Perhaps it would be better
to add a new caching layer between libc
and our memory contexts, and
track these chunks there. We’d still risk wasting the memory, but this
cache would be shared by all memory contexts, possibly even between
query executions. I actually wrote an experimental version of this, see
the memory pool
patch in this thread for details.
But there’s a much simpler way to improve this - glibc
allows tuning
the malloc
behavior - e.g. how eagerly is the memory returned to the
kernel. Because that’s likely the problem here - Postgres returns the
memory to glibc
(by calling free
), and glibc
hands it back to the
kernel. Then we request a new chunk from glibc
, and it has to get it
from the kernel, incurring syscall
overhead etc.
There’s a mallopt
function to adjust this behavior, with a manpage
describing the available tunables. The one important for this post is
MALLOC_TOP_PAD
, which determines how much “slack” we keep in the
heap. When requesting more memory from the kernel using sbrk
, we get
a bit more of it. Conversely, when freeing memory, we don’t hand it
back to the kernel immediately, but keep a buffer for future requests.
The default MALLOC_TOP_PAD
value is 128kB, but this is clearly not
enough - with 1000 index scans, we need about 27MB just for that. Let’s
try bumping the value to 64MB. The easiest way is to set an environment
variable:
# set MALLOC_TOP_PAD to 64MB
export MALLOC_TOP_PAD_=$((64*1024*1024))
Then restart the Postgres instance (in the same shell). And if you run
the benchmarks again, you should get something like 250tps and 1350tps.
That’s about 2x the original throughput - not bad for setting a glibc
environment variable.
Note: Notice the trailing underscore at the end of the variable.
This is intentional, not a typo - see the mallopt
manpage.
Conclusion
Of course, even these improved values are still much lower than results from tests without partitioning. There’s clearly a cost for processing many more relations. If you have 1000 relations instead of 1, it would be weird if planning/execution wasn’t more expensive.
In this case the cost (throughput drops to 1/400) is extreme, partially due to how the example is constructed. The dataset is tiny (so no I/O), and the query has to touch all the partitions. It’s not entirely crazy, though. Machines with enough RAM to keep everything in-memory are common, and no partitioning scheme works for all queries.
It’s advisable to pick a conservative number of partitions - don’t use too many. For example if you want to archive data every month, go with monthly partitions, don’t go with daily partitions “just in case”.
Similarly, be careful when designing the partitioning scheme. It will determine which queries can benefit from partition elimination. No scheme is perfect for all queries - it’s a tradeoff. Pick a schema that works well for the important / frequent queries.
We’ll continue improving scalability with many partitions. But that just
means costs like allocation will have more impact in the future. And so
tuning glibc
(and similar) will be more important.