Tomas Vondra

Tomas Vondra

blog about Postgres code and community

[PATCH IDEA] parallel pgbench -i

There are multiple tools to run benchmarks on Postgres, but pgbench is probably the most widely used one. The workload is very simple and perhaps a bit synthetic, but almost everyone is familiar with it and it’s a very convenient way to do quick tests and assessments. It was improved in various ways (e.g. to do partitioning), but the initial data load is still serial - only a single process does the COPY. Which annoys me - it may take a lot of time before I can start with the benchmarks itself.

This week’s “first patch” idea is to extend pgbench -i to allow the data load to happen in parallel, with multiple clients generating and sending the data.

If you didn’t read the first post about how to pick the first patch, maybe do so now. Everything I wrote in that post still applies - you should pick a patch that’s interesting (and useful) for you personally. Don’t jump on this idea simply because I posted about it.

If you decide to give this patch idea a try, let me know. Maybe not right away, but once you get past the initial experiments. Otherwise multiple people might be working on it, having to throw the work away after the first one submits it to pgsql-hackers. It gives insights and ability to review the submitted patch, so not a total waste of time. But if your goal was to write the first patch …

Motivation

I think I mostly already explained the motivation - speedup pgbench -i so that the actual benchmark setup takes less time.

But on second thought, the setup can be considered a benchmark on its own. It’s useful to know how fast you can push large amounts of data into the database - analytical systems do that all the time. So why not have a convenient way to do that?

Implementation

This patch idea is a bit different from the earlier ones, because this is entirely on the client side. pgbench connects to the database over socket, but the code is much simpler. It’s a rather simple traditional C application. It does not use any of the infrastructure specific to the server - memory context, shared memory, and so on. For anyone with basic C knowledge, it’s much simpler to understand and modify.

pgbench already knows how to run multiple processes - if you specify -j N, it creates multiple “jobs” to generate the workload and manage client connections. This is done using pthread and the parallel load can use that too.

I imagine it would be possible to do pgbench -i -s 10000 -j 32 and this would load the data (in this case ~150GB) using 32 clients doing COPY of a subset of the data.

How exactly would the data be split between workers? There are different ways to do that, I’m not sure which one is the best. For example:

  • We know the range of IDs we need to generate, so we can simply split the range into N ranges, and every worker loads one of those. This breaks the locality/correlation of the single-process data.

  • We can have a coordinator that assigns the workers smaller ranges of IDs to generate data for. This maintains the locality much better.

  • If the setup uses partitioning, we could make each worker responsible for loading a subset of partitions. This would work nicely with range partitions.

There are probably a couple other ways to do this. Figuring out which strategy to use is likely one of the important tasks.

Risks

I think the main risk is that the speedup may be worse than expected, and the improvement won’t be considered worth the complexity. This can be mitigated by doing some quick tests first, to see how much faster a parallel load would be.

Conclusions

That’s it. As always - if you’re interested, feel free to reach out to me directly by email. Or you can talk to other Postgres developers in pgsql-hackers, or the new discord channel.

Do you have feedback on this post? Please reach out by e-mail to tomas@vondra.me.