So why don't we pick the optimal query plan?

Last week I posted about how we often don’t pick the optimal plan. I got asked about difficulties when trying to reproduce my results, so I’ll address that first (I forgot to mention a couple details). I also got questions about how to best spot this issue, and ways to mitigate this. I’ll discuss that too, although I don’t have any great solutions, but I’ll briefly discuss a couple possible planner/executor improvements that might allow handling this better.

Reproducing the results

I shared the SQL and shell script I used to generate the data sets and queries, but I forgot to mention a couple important details. That unfortunately affected reproducibility.

SQL script (data generator)

The SQL script requires two parameters:

ROWS - number of rows to insert into the tables
CYCLES - number of “cycles” in the cyclic data sets

For my tests I used 10M rows and 100 cycles, which made the tables about 4.3GB each. Something like this should give you about the same database.

$ sed 's/ROWS/10000000/g' create.sql | sed 's/CYCLES/100/g' | psql testdb

The script generates random data, but the size and data distribution / patterns should be the same.

bash script (query generator)

The bash script expects three position parameters:

dbname - the DB to run against (with all data already pre-generated using the SQL script)
datadir - Postgres data directory (so that it can be restarted)
timeout - timeout for queries (this is just the usual statement timeout in milliseconds)

So run it like this:

$ ./run-test.sh testdb /mnt/pg/data-directory 60000

You might need to adjust the timeout a bit, depending on the type of storage you use. The goal is to allow most queries complete, and only abort some extreme cases.

Postgres configuration

The last piece is the configuration of the instance. I’ve done only the most basic tuning of parameters, especially:

shared_buffers = 32GB
random_page_cost = 2

The shared_buffers value affects the execution a little bit, but it’s not considered during planning. The random_page_cost parameter is very important, and it affects planning in an interesting way. I’ll discuss it in detail in a bit.

All other parameters were left at default values.

With this you should be able to reproduce the same behavior I described in my post. You’ll also need a trivial drop-caches.sh script, which simply evicts everything from the page cache to get “cold” query durations.

Of course, the exact behavior depends on the hardware, but if you’re running on some sort of flash storage (ideally local SSDs), you should see the same behavior I presented.

random_page_cost

It took me a while to figure out why several people struggled to reproduce the poor plan choices I saw. Then I realized they’re running with the default parameters, especially random_page_cost = 4. Once they changed it to the lower value 2, they got the same “Index Scan” plans. Why is that?

Consider the following chart, with estimated costs of the three basic types of scans with random_page_cost values 1.25, 2 and 4. This is for the uniform data set, about the simplest type of data distribution. Use PNG or PDF for higher-resolution versions of the chart.

The random_page_cost value has no impact on the sequential scan (blue dots), and bitmap scans seem almost unaffected too (the cost does change, but not much).

The cost of index scans changes a lot. That makes perfect sense, because index scans do a lot of random I/O, it’s the dominant part of the cost. The lower the random_page_cost, the cheaper the index scan. It’s clearly visible on selectivities between 10% and 100%.

The total cost is roughly proportional to the random_page_cost value. Going from 4 to 2 cuts the cost to 50%, and going to 1.25 means about 30%. At least for selectivities 10-100%, it’s less visible close to 0%.

But the planner does not care about the cost of individual plans in isolation, it cares about how it compares to other plans. And these charts show that as the random_page_cost value decreases, index scans start to win (over bitmap scans) more often. The dark points show which scan is picked by the optimizer. The number of red ones increases, which means the planner picked index scan as the cheapest one.

cold data

Here are charts of duration for “cold runs” (with no data cached) for the same random_page_cost values (PNG or PDF).

The shapes do not change at all for different random_page_cost values, which makes sense - the parameter only matters for planning. Once a plan is picked, the value becomes irrelevant. The duration of a plan does not depend on how it was selected.

But the shift in picking index scans as the “optimal” plan more often still applies. The lower the random_page_cost, the higher the selectivity where the planner thinks it’s the cheapest plan. For random_page_cost = 4.0, the index scan wins only once at ~1%. For random_page_cost = 1.25 it wins everywhere between 1% to 10%. And in all those cases it’s much slower than the bitmap scan (and even compared to seqscan).

warm data

The above chart is for durations on cold data, i.e. with both shared buffers and page cache empty. If the data is cached, you get this chart instead (hi-res PNG or PDF).

We see the same shift between plans, but without the I/O the differences are mostly negligible. It’s nice and smooth and does not get much slower than the “dumb” sequential scan.

Is “`random_page_cost = 2.0`” wrong?

So did I make a tuning mistake when I changed the random_page_cost? I don’t really think so. The default value (4) is decades old at this point, and it was set back when everyone used “old” rotational disks. While there’s still a difference between random and sequential I/O even on modern storage (like NVMe flash), it’s getting smaller. Based on this, it makes sense to reduce the value.

I’m aware of recommendations to set random_page_cost much closer to 1.0, and even the official documentation mentions potentially setting it to 1.1 for SSDs:

Storage that has a low random read cost relative to sequential, e.g., solid-state drives, might also be better modeled with a lower value for random_page_cost, e.g., 1.1.

That might be too aggressive. Even with NVMe SSDs, random I/O remains more expensive than sequential I/O. So unless you assume everything is cached (like in the “warm” case above), I’d stick to higher values. Maybe if you do some careful testing on your data set and workload, but without that 2.0 seems like a reasonable compromise.

The estimated cost might also be off due to some other estimates being significantly off. For index scans the cost is calculated by cost_index(), so maybe it mis-estimates some input parameter? Say, number of heap pages to visit? I did check that for the presented examples, and I don’t think that’s happening. Which is not surprising, uniform distribution is the default assumption of all the formulas, fairly easy to estimate. It’s the case where the estimates tend to be most accurate.

How to identify the issue?

So how do you identify you’re affected by this behavior? I’m not sure.

If you’re executing queries on the “threshold” between the fast and slow plans, you might notice there are two modes. This is the case discussed in my “performance cliffs” talks. That’s easy to notice.

But if your queries consistently fall into the “slow” region, you’ll only see the slow executions. How would you know a faster plan is possible? You’d have to guess there’s a faster plan, and then experiment with enable_ options until you convince the planner to pick it.

Such experiments are doable only for a small number of relatively simple queries. As queries get more complex and/or the number of queries increases, it gets hard to explore them. You may focus on the expensive queries, but you still won’t know if a better plan exists.

How could we fix / improve this?

I hesitate to call this behavior a bug, because it works as designed. I can’t rule out the possibility of a bug, of course. Feel free to investigate, I might have missed something.

Is there something we could do about this in the database itself? I have a couple ideas …

Should we adjust `random_page_cost` default?

In the presented examples the fundamental problem seems to be the costing of index scans. Maybe it’s all due to random_page_cost simply being too low? Maybe it’s time to re-evaluate what the default should be? Or maybe we should have some automated way to tune the parameter for specific hardware?

Every now and then there’s a suggestion to have a tool similar to pg_test_fsync that’d do some I/O tests and compute the “correct” random_page_cost value. Maybe we should try implementing such a tool, but there’s a catch. It probably can’t be just a “simple” hardware test, as the value is not a simple ratio of costs for random vs. sequential I/O.

AFAIK the current default (4.0) was picked based on a benchmark that ran queries with different scan types, and 4.0 happened to switch plans at the “right places.” I’m sure it’s described in some old thread on pgsql-hackers, but I can’t find it at the moment.

So it’s a value selected so that the cost model makes the right decision, which is a little bit backwards.

If implementing this in a tool turns out to be too difficult, perhaps we should at least redo the benchmark and adjust the random_page_cost default (or the tuning advice) accordingly?

The expectation is we’d end up with a lower rantom_page_cost value, because the hardware got much better in handling random I/O. But I’m not entirely sure of that. We’ve improved all parts of Postgres, and some of that likely made sequential scans cheaper. The OS also improved relevant bits (like readahead etc.). I wouldn’t be surprised if we actually ended up with a higher random_page_cost default.

Make the cost model aware of prefetching?

Another thing that occurred to me is this may be due to the optimizer not considering prefetching. Bitmap scans prefetch heap pages, while index scans do not. Yet the optimizer is oblivious to this.

This may not have been that obvious with effective_io_concurrency = 1, but PG 18 increases the default to 16, and that likely helps bitmap scans quite a bit. Maybe it’s time to account for this when costing bitmap scans?

We could also do the opposite thing, add prefetching to index scans. We need to do that for other reasons (e.g. to support direct I/O), and I’ve been working on this for quite a while. Here’s a chart from one of the tests including index scan with prefetch:

Prefetching would put index scans on par with bitmap scans, at least for the “medium selectivity” up to ~10%. Above this threshold the bitmap scans benefit from only visiting each page once (and in a sequential pattern), while index scans have to visit it repeatedly.

But even with this, when scanning 100% of the table, prefetching makes the index scans 2x faster.

Summary

Ultimately, there’s only so much we can do in the optimizer. It makes decisions based on a very simplified approximation of the data set, and so it’s bound to make mistakes. And we’re not making its task any simpler by adding more and more options to the executor. Every option to run a query differently means an opportunity to make a mistake.

I don’t think it’s possible to fix that by making the planner smarter. We should keep improving the planner, of course. And there are interesting ideas like better stats, adaptive optimization, or using feedback from past executions. We should explore that.

But such a priori decisions simply can’t be 100% correct. An “perfect” planner that never makes mistakes (with arbitrary queries on arbitrary data) is an unachievable goal.

We need to improve the executor to be more resilient against such mistakes, and handle them gracefully. There’s a whole research discipline focused on robustness of the query execution.

Note: As a starting point I highly recommend the series of Dagstuhl Seminars on robust query processing in various contexts (the link is for 2024, scroll down to “related seminars”).

Do you have feedback on this post? Please reach out by e-mail to tomas@vondra.me.

Tomas Vondra