Tomas Vondra

Tomas Vondra

blog about Postgres code and community

[PATCH IDEA] adaptive execution for `IN` queries

Last week I visited the Malmö PUG to talk about performance cliffs. It’s a really nice meetup - cozy environment, curious audience asking insightful questions. I highly recommend attending or even giving a talk there.

After the meetup I realized it’s been a while since I posted about some patch idea, and the performance cliff talk has a couple good candidates. Some might be a bit too hard for the first patch, for example improving the JIT costing. But improving the very first example about queries with IN clauses seems feasible. It’s quite well defined and isolated.

So, what’s the problem with IN queries? This might be a good time to watch the talk - I’m aware of two recordings:

The SF PUG is a regular 45-minute talk, POSETTE talks are shorter (about 25 minutes), but should be enough to explain the problem. There’s also a recent postgres.fm podcast discussing performance cliffs, but it’s a more general discussion, on top of the talk.

Motivation

The gist of the problem is fairly simple - consider a query with an IN clause:

SELECT * FROM test_table WHERE x IN (32, 243, 11, 123, ..., 999, 753)

If you don’t have an index on the x column, this will do a sequential scan and has to perform a membership check for each row. There are multiple ways to do that - we can do a linear search, for example. For short lists (say, a couple dozen elements) that’ll be just fine. If the lists get too long, we could sort the list and use a binary search, or maybe build a hash table. Either of these will be much faster than the linear search through a long list.

Linear search was the only implemented strategy up until Postgres 14. Then commit 50e17ad281b8 added support for building a hash table on long lists. Which means we can use the naive linear search for short lists, and a hash table for longer lists. Great!

But here lies the problem - where exactly is the threshold between long and short lists? Is it 8 elements, 100 elements? The commit simply hard-codes the threshold - we’ll use linear search for lists with up to 8 elements, and a hash table for lists with 9 elements or more.

That works fine for most cases, but it’s not hard to construct cases where we should have switched at a different point. In the talk I show an example when making the list shorter doubles the query duration. That’s surprising, the intuition is that we’re reducing the amount of work the database may need to do, so it’s natural to expect the query to get a bit faster.

The root cause is that removing an element from the list results in switching from a hash table lookup to a linear search. And the example is constructed so that comparisons are expensive (because the strings have long prefixes). That may seem unrealistic, but there are other reasons why data type comparisons may be expensive, so maybe not.

I’m pretty sure I could construct an example in the opposite direction, switching to a hash table too early - that would require the comparison to be very cheap (compared to calculating the hash value etc.).

This happens because we pick the strategy based on the number of elements, and nothing more. Calculating a “better” threshold would need to consider the cost of comparison and hash calculation, but we don’t] have this information. We do have some very rough cost information for each function (see pg_proc.procost), including the equality and hash for the data type. But those are too rough to be useful for this purpose.

It also can’t reflect how the cost is affected by the data - for example without the long prefix, the comparisons would be much cheaper. But that’s not something the procost constant can express.

So the idea is to collect enough timing information at runtime about each strategy, and use that to adjust the threshold based on that.

Implementation

The code backing IN () clauses is nicely isolated. I suggest looking at the commit 50e17ad281b8 that introduced the hash table lookup, but to sum it up:

  • The decision which strategy is currently done during query planning, in convert_saop_to_hashed_saop (in prepqual.c).

  • The evaluation of IN clauses happens in execExpr.c and execExprInterp.c (search for EEOP_SCALARARRAYOP and EEOP_HASHED_SCALARARRAYOP opcodes).

The first step would be to allow changing the decision during execution. Right now this happens during planning, and we set hashfuncid if we decide to use a hash table. But that does not work if we want to change the decision at execution time - we may want to start with a linear search, and then flip to a hash table later.

So we can’t rely on hashfuncid to track the current strategy. It would only determine whether we can do hash (some data types may not implement it). And then we’d have a separate flag to track the strategy.

The execExpr (expression interpreter) would need to be updated to use the new flag, but that’s relatively straightforward.

The expression interpreter may seem a bit confusing at first, due to a layer of opcodes and JIT. But what matters are the two functions implementing the strategies:

  • EEOP_SCALARARRAYOP -> ExecEvalScalarArrayOp

  • EEOP_HASHED_SCALARARRAYOP -> ExecEvalHashedScalarArrayOp

I imagine these functions would do the membership check just like now, and also collect enough information to calculate more accurate costs for each strategy. And then maybe flip to the other strategy, if appropriate.

These changes should be relatively straightforward - I believe. The hard part is collecting sufficiently accurate cost data in a cheap way. This may require thinking out of the box, creativity and experimentation.

For example, it might seem natural to just call gettimeofday before and after each check and calculate how long it took. But gettimeofday can be damn expensive, and also not sufficiently accurate (microseconds at best, but not guaranteed). It’s very hardware and OS dependent, so hard to say. But it’s why EXPLAIN ANALYZE often takes much longer than just executing the query.

But there may be alternative ways to measure the cost - I suspect we might use RDTSC, or maybe HPET with sampling, or something else. We need to know the time in “human time”, we just need to compare costs of the strategies.

If I was hacking on this, I’d try to implement this so that

  • Stick to the decision made by convert_saop_to_hashed_saop during query planning for a certain number of rows (say 1000?). Measure the cost of the checks.

  • Then switch to the other strategy for a bit, to get a cost for that too. It might be just a couple calls, enough to get a reliable estimate. And then adjust the strategy based on that.

  • After adjusting the strategy, verify the decision from time to time. The cost may depend on data - the linear search maye be cheap at the beginning of the table and expensive at the end, or something like that. So rechecking from time to time seems like a good idea.

Risks

I suspect the main risk is (not) finding a good way to measure the cost of the different strategies. Those checks may be pretty short (very few instructions), so measuring needs to be very cheap but also accurate. I’m not sure it’s possible to have both.

On the other hand, if the checks get too fast to measure accurately, perhaps that means adjusting the strategy won’t make a difference anyway.

Another risk is platform support. I’ve suggested using RDTSC earlier, but that’s a x86 instruction, but that’s not the only platform we support. It would be good to find a good solution e.g. for arm64 (a quick search suggests CNTVCT_EL0 might be useful), or at least an acceptable generic fallback.

Conclusions

So that’s it. It’s not a huge patch, but it might be an interesting robustness improvement. As always, feel free to reach out to me directly by email. Or you can talk to other Postgres developers in pgsql-hackers, or the new discord channel.

Do you have feedback on this post? Please reach out by e-mail to tomas@vondra.me.