Some more thoughts on random_page_cost

A couple months back I posted about maybe adjusting random_page_cost to better reflect how current storage handles random and sequential access. I had a bunch of great discussions about the topic since then, but ultimately I got distracted by other stuff.

POSETTE happened last week, with my pre-recorded talk about this very topic (and many other great talks, BTW). Which reminded me that I started thinking about random_page_cost a bit differently. So here’s an update with some more thoughts.

I did already touch on some of these things in the old post, and then also in the POSETTE talk in more detail.

Before I get to that, let me share a chart with random_page_cost results from rotational SATA drives. I completely forgot I have these disks in the machine until I opened the case to do some maintenance. Those are likely much closer to the storage used for the original experiments in ~2000. Those would likely be PATA or SCSI drives, but still rotational. Maybe that will give us values closer to the 4.0 default?

Clearly not. In fact, the estimated random_page_cost is ~125, about 2-4x the estimate for SSD storage. So with SSDs it’s getting closer to the default, but that’s just a coincidence.

Perhaps there’s some fundamental piece of the old experiment that we failed to recall? Or maybe the “raw” results were adjusted in some way. But it seems the 4.0 default never was the “raw” cost of random I/O.

I got a lot of feedback from people who tried increasing random_page_cost in the past. In their experience it definitely did not improve the performance, it hurt it. How is that possible, if it makes the costing less accurate?

I believe it comes to random_page_cost “compensating” for the cost model being incomplete. It’s not accounting for various caching effects and resources related to plans performing a lot of random I/O.

Every cost model is an approximation, and a relatively crude one. It’s not possible to have a fast/cheap cost model that accurately mimics every tiny detail. You may make it more and more detailed, but at some point it’d become as large as the original system. And then why have a model? It’d be pretty useless.

Our cost model has a couple gaps that I think matter here.

Cost model (mostly) ignores memory

The cost of an operation is calculated from the amount of CPU and I/O used to execute the operation. Those are two key resources, but it ignores memory

another important resource.

We have work_mem, but that’s more of a safety limit, as it limits the size of work buffers (e.g. for sorting or hashing). And it may affect the amount of I/O an operation needs to do (e.g. smaller buffers in hash join means more spilling to disk).

But it does not track other memory “used” by a plan, it ignores other uses of memory.

Consider a 100GB table, containing 1GB of “interesting” data (matching our query predicate). We can scan the table sequentially or through an index. If the table is “cold”, a sequential scan may push ~100GB of other data from memory (shared buffers or page cache). The query may run just fine with work_mem=4MB, but it may effectively “use” 100GB of memory.

On the other hand, an index may only need to access ~1% of the table (the 1GB of interesting data). The random I/O may easily take more time than the sequential scan, but on the other hand it’ll only “use” 1GB of memory. It may evict much less other data from cache.

Yet the cost model is completely oblivious to this.

Locality of access

Another reason is the concept of an active set. You may have a 1TB database, but in most practical systems only a tiny subset of the data is accessed. Users only look at recent orders, etc. This is what we call the “active set” of the database.

The key is to keep the active set in memory. And plans doing a lot of random I/O tend to be more localized, i.e. accessing only the “interesting” data. An index scan will typically access a much smaller fraction of data than a sequential scan. The I/O may be more expensive (random, visiting pages multiple times), but it’s just the “interesting” data.

Sequential scans may massively expand the active set - possibly to the whole database. Which is not great, unless you actually have enough RAM to fit the whole database. Index scans allow for a much smaller active set.

But the planner is completely unaware of the active set. It plans as if all queries start with cold data. It considers caching effects only in very limited cases - e.g. caching within the same query.

The planner is oblivious to various other things. It plans queries in isolation, as if there were no other queries competing for the same resources and the bandwidth was infinite. A sequential scan may be great for a single backend, but when there are 100 backends all doing seqscans, you’ll hit storage bandwidth.

Conclusions

I now see random_page_cost as a proxy for all these effects.

The planner is not costing memory the same way as CPU/disk, nor does it understand caching effects. Lower random_page_cost values push it toward plans that are more localized and keep the active set under control. Which in the long run seems like the winning strategy.

This means there can’t be a tool that’d tell you the “right” value for random_page_cost after profiling your storage and measuring how it handles sequential and random I/O. I’d love to have that, but if random_page_cost is not a raw I/O cost, that would not work.

Tuning random_page_cost has to be system-specific, driven by feedback from monitoring. For example, you may look at the top queries in pg_stat_statements, and see if those would benefit from a lower random_page_cost value. Then try to adjust the value, and monitor if it has the desired impact (or if it hurts other queries).

Do you have feedback on this post? Please reach out by e-mail to tomas@vondra.me.

Tomas Vondra

Some more thoughts on random_page_cost

Cost model (mostly) ignores memory

Locality of access

Conclusions