[PATCH IDEA] adaptive execution for `IN` queries
Last week I visited the Malmö PUG to talk about performance cliffs. It’s a really nice meetup - cozy environment, curious audience asking insightful questions. I highly recommend attending or even giving a talk there.
After the meetup I realized it’s been a while since I posted about some
patch idea, and the performance cliff talk has a
couple good candidates. Some might be a bit too hard for the first
patch, for example improving the JIT costing. But improving the very
first example about queries with IN
clauses seems feasible. It’s quite
well defined and isolated.
So, what’s the problem with IN
queries? This might be a good time to
watch the talk - I’m aware of two recordings:
The SF PUG is a regular 45-minute talk, POSETTE talks are shorter (about 25 minutes), but should be enough to explain the problem. There’s also a recent postgres.fm podcast discussing performance cliffs, but it’s a more general discussion, on top of the talk.
Motivation
The gist of the problem is fairly simple - consider a query with an IN
clause:
SELECT * FROM test_table WHERE x IN (32, 243, 11, 123, ..., 999, 753)
If you don’t have an index on the x
column, this will do a sequential
scan and has to perform a membership check for each row. There are
multiple ways to do that - we can do a linear search, for example. For
short lists (say, a couple dozen elements) that’ll be just fine. If the
lists get too long, we could sort the list and use a binary search, or
maybe build a hash table. Either of these will be much faster than the
linear search through a long list.
Linear search was the only implemented strategy up until Postgres 14. Then commit 50e17ad281b8 added support for building a hash table on long lists. Which means we can use the naive linear search for short lists, and a hash table for longer lists. Great!
But here lies the problem - where exactly is the threshold between long and short lists? Is it 8 elements, 100 elements? The commit simply hard-codes the threshold - we’ll use linear search for lists with up to 8 elements, and a hash table for lists with 9 elements or more.
That works fine for most cases, but it’s not hard to construct cases where we should have switched at a different point. In the talk I show an example when making the list shorter doubles the query duration. That’s surprising, the intuition is that we’re reducing the amount of work the database may need to do, so it’s natural to expect the query to get a bit faster.
The root cause is that removing an element from the list results in switching from a hash table lookup to a linear search. And the example is constructed so that comparisons are expensive (because the strings have long prefixes). That may seem unrealistic, but there are other reasons why data type comparisons may be expensive, so maybe not.
I’m pretty sure I could construct an example in the opposite direction, switching to a hash table too early - that would require the comparison to be very cheap (compared to calculating the hash value etc.).
This happens because we pick the strategy based on the number of
elements, and nothing more. Calculating a “better” threshold would need
to consider the cost of comparison and hash calculation, but we don’t]
have this information. We do have some very rough cost information for
each function (see pg_proc.procost
), including the equality and hash
for the data type. But those are too rough to be useful for this
purpose.
It also can’t reflect how the cost is affected by the data - for example
without the long prefix, the comparisons would be much cheaper. But
that’s not something the procost
constant can express.
So the idea is to collect enough timing information at runtime about each strategy, and use that to adjust the threshold based on that.
Implementation
The code backing IN ()
clauses is nicely isolated. I suggest looking
at the commit 50e17ad281b8
that introduced the hash table lookup, but to sum it up:
The decision which strategy is currently done during query planning, in
convert_saop_to_hashed_saop
(inprepqual.c
).The evaluation of
IN
clauses happens inexecExpr.c
andexecExprInterp.c
(search forEEOP_SCALARARRAYOP
andEEOP_HASHED_SCALARARRAYOP
opcodes).
The first step would be to allow changing the decision during execution.
Right now this happens during planning, and we set hashfuncid
if we
decide to use a hash table. But that does not work if we want to change
the decision at execution time - we may want to start with a linear
search, and then flip to a hash table later.
So we can’t rely on hashfuncid
to track the current strategy. It would
only determine whether we can do hash (some data types may not implement
it). And then we’d have a separate flag to track the strategy.
The execExpr
(expression interpreter) would need to be updated to use
the new flag, but that’s relatively straightforward.
The expression interpreter may seem a bit confusing at first, due to a layer of opcodes and JIT. But what matters are the two functions implementing the strategies:
EEOP_SCALARARRAYOP
->ExecEvalScalarArrayOp
EEOP_HASHED_SCALARARRAYOP
->ExecEvalHashedScalarArrayOp
I imagine these functions would do the membership check just like now, and also collect enough information to calculate more accurate costs for each strategy. And then maybe flip to the other strategy, if appropriate.
These changes should be relatively straightforward - I believe. The hard part is collecting sufficiently accurate cost data in a cheap way. This may require thinking out of the box, creativity and experimentation.
For example, it might seem natural to just call gettimeofday
before
and after each check and calculate how long it took. But gettimeofday
can be damn expensive, and also not sufficiently accurate (microseconds
at best, but not guaranteed). It’s very hardware and OS dependent, so
hard to say. But it’s why EXPLAIN ANALYZE
often takes much longer
than just executing the query.
But there may be alternative ways to measure the cost - I suspect we might use RDTSC, or maybe HPET with sampling, or something else. We need to know the time in “human time”, we just need to compare costs of the strategies.
If I was hacking on this, I’d try to implement this so that
Stick to the decision made by
convert_saop_to_hashed_saop
during query planning for a certain number of rows (say 1000?). Measure the cost of the checks.Then switch to the other strategy for a bit, to get a cost for that too. It might be just a couple calls, enough to get a reliable estimate. And then adjust the strategy based on that.
After adjusting the strategy, verify the decision from time to time. The cost may depend on data - the linear search maye be cheap at the beginning of the table and expensive at the end, or something like that. So rechecking from time to time seems like a good idea.
Risks
I suspect the main risk is (not) finding a good way to measure the cost of the different strategies. Those checks may be pretty short (very few instructions), so measuring needs to be very cheap but also accurate. I’m not sure it’s possible to have both.
On the other hand, if the checks get too fast to measure accurately, perhaps that means adjusting the strategy won’t make a difference anyway.
Another risk is platform support. I’ve suggested using RDTSC
earlier,
but that’s a x86
instruction, but that’s not the only platform we
support. It would be good to find a good solution e.g. for arm64
(a
quick search suggests CNTVCT_EL0
might be useful), or at least an
acceptable generic fallback.
Conclusions
So that’s it. It’s not a huge patch, but it might be an interesting robustness improvement. As always, feel free to reach out to me directly by email. Or you can talk to other Postgres developers in pgsql-hackers, or the new discord channel.