Tomas Vondra

Tomas Vondra

blog about Postgres code and community

[PATCH IDEA] Using COPY for postgres_fdw INSERT batching

In an earlier post I mentioned I plan to share a couple patch ideas, suitable for new contributors. This is the first one, about using COPY protocol for postgres_fdw batching. This would replace the current implementation, based on prepared statements. Let me share a couple thoughts on the motivation and how it might be implemented.

Everything I wrote in the first post still applies. In particular, you should be looking for a patch that’s interesting (and useful) for you personally. Don’t jump on this idea simply because it’s the first patch idea I posted.

If you decide to give this patch idea a try, let me know. Maybe not right away, but once you get past the initial experiments. Otherwise multiple people might be working on it, having to throw the work away after the first one submits it to pgsql-hackers. It gives insights and ability to review the submitted patch, so not a total waste of time. But if your goal was to write the first patch …

Since PostgreSQL 14, the postgres_fdw module supports bulk inserts. That means inserts are done in batches, and the main benefit is of course better performance, especially on connections with higher latency.

The way this was implemented is a bit weird, though. rebuildInsertSql constructs an INSERT query for a particular number of rows, passing the values as parameters in the VALUES clause:

INSERT INTO my_table VALUES ($1, $2), ($3, $4), ...., ($10001, $10002);

The statement is then prepared, and executed repeatedly. This works, and postgres_fdw relies on prepared statements, so it’s not unreasonable.

Motivation

But there are some drawbacks too. Firstly, the batch size depends on how wide the table is - the maximum number of parameters is 64k (the indexes are uint16), so for a wide table batches may be smaller than expected.

We actually have a special command for bulk loads - COPY. That doesn’t have the problem with the 64k limit on the number of parameters, and also does not need to do any planning, so it should be more efficient.

So the patch idea is to move the bulk inserts in postgres_fdw from prepared statements to COPY.

Implementation

I believe the implementation should be fairly straightforward. Pretty much everything important happens in execute_foreign_modify.

The current code first rebuilds the INSERT statement (if needed), to match the batch size, and then fills in the parameters. For COPY this should get simpler, as it does not depend on the batch size.

In fact, COPY does not even support prepared statements, and the values will need to be passed in a different way too (probably using the text format). Perhaps this should be handled as a special case right at the beginning of execute_foreign_modify, and then exit early?

Risks

I can’t think of any major risks. The patch should only need to touch a very isolated part of the code, in a single extension. It pretty much just generates a different query for bulk inserts, that’s it. It will require testing - for correctness, and also some performance tests, to demonstrate the expected benefit. But every patch needs that.

Conclusions

So, that was the first patch idea. If needed, you can reach out to me directly by email. There’s also a number of places where you can talk to a bigger group of Postgres developers, for example the pgsql-hackers, or the new discord channel. There are many similar channels, on various other social platforms.

Do you have feedback on this post? Please reach out by e-mail to tomas@vondra.me.