Tomas Vondra

Tomas Vondra

blog about Postgres code and community

[PATCH IDEA] Statistics for the file descriptor cache

Let me present another “first patch” idea, related to a runtime stats on access to files storing data. Having this kind of information would be very valuable on instances with many files (which can happen for many reasons).

This is a very different area than the patch idea, which was about an extension. The runtime stats are at the core of the system, and so is the interaction with the file systems. But it’s still fairly isolated, and thus suitable for new contributors.

If you didn’t read the first post about how to pick the first patch, maybe do so now. Everything I wrote in that post still applies - you should pick a patch that’s interesting (and useful) for you personally. Don’t jump on this idea simply because I posted about it.

If you decide to give this patch idea a try, let me know. Maybe not right away, but once you get past the initial experiments. Otherwise multiple people might be working on it, having to throw the work away after the first one submits it to pgsql-hackers. It gives insights and ability to review the submitted patch, so not a total waste of time. But if your goal was to write the first patch …

PostgreSQL has a lot of monitoring statistics about the database. This is useful for monitoring and investigating all sorts of operational issues. The docs have a very nice overview about this [https://www.postgresql.org/docs/current/monitoring-stats.html](cumulative statistics system) listing all the system views and available information. There’s data about tables/indexes, currently running queries, replication, SSL, WAL, … Most of the important subsystems provide some stats.

Note: Don’t confuse this with statistics used by the optimizer when planning queries. That’s a different thing. This patch idea is about stats used for monitoring - what the database is doing, not the data it’s processing.

Motivation

An important part of the database is accessing data stored in files on disk. Postgres stores data for individual objects separately - each table or index gets a separate “file” in the database directory. But not only that, the files are split into 1GB chunks (called “segments”).

For example, this is a tiny part of a list of files in directory for a database with OID 16478:

xeon:/mnt/data/pgdata/data/base/16478# ls -l
...
-rw------- 1 postgres postgres 1073741824 Aug 23 19:35 16491
-rw------- 1 postgres postgres 1073741824 Aug 23 19:35 16491.1
-rw------- 1 postgres postgres 1073741824 Aug 23 19:35 16491.10
-rw------- 1 postgres postgres 1073741824 Aug 23 19:35 16491.11
-rw------- 1 postgres postgres 1073741824 Aug 23 19:35 16491.12
-rw------- 1 postgres postgres 1073741824 Aug 23 19:35 16491.13
-rw------- 1 postgres postgres 1073741824 Aug 23 19:35 16491.14
...

There are many more files in that directory, but all those files store data for a single object. I’m not sure if the object is a table or an index, but that doesn’t matter too much. There’s much more details on the database file layout in the documentation.

There are reasons why it was done this way. Some older filesystems had limits on file size, and it may be difficult to deal with very large files for other reasons. But this post is not about that.

The consequence however is that databases may have a lot of files. 1TB table needs 1000 segments, and you also need segments for indexes. There can be much larger tables, perhaps even multiple of them, etc. Or maybe you just have many small tables.

Partitioning is becoming more common, and it’s not quite uncommon to see partitioned tables with hundreds or even thousands of partitions. Those partitions may be small (even smaller than 1GB), but they are separate objects and thus get a separate file.

So the number of files in a large instance can be quite high, and that comes with some overhead when accessing those files. In particular, it’s not free to open (or close) a file.

To deal with that, Postgres caches the file descriptors opened in each backend. The number of files kept open is defined by a the max_files_per_process configuration parameter:

max_files_per_process = 1000

This is the default value, and it means each backend will keep up to 1000 files open. That mostly works fine, but if your queries need to access more files, that may be fine.

Note: The size of the cache is not just up to Postgres, it depends on limits set by the operating system too. Check ulimit -n to make sure the max_files_per_process actually is in effect.

Consider for example query doing COUNT(*) on a table with 1000 small partitions. Those file descriptors won’t fit into the file descriptor cache, and will “trash” the cache on each execution. The entries would never be reused before eviction.

Opening/closing files is not extremely expensive, but it’s not free either. It requires a syscall, and filesystems need to optimize for other important stuff too. The amount of overhead may depend on the filesystem, of course.

The easiest solution in Postgres is to make the cache larger, to prevent the trashing. Results from a simple benchmark with max_files_per_process 1000 or 32768 on a big machine (in this case AMD EPYC with ~96/192 cores) may look like this:

benchmark for max_files_per_process 1000 vs. 32768

Yup, that’s ~4-5x improvement of the throughput. Not bad!

Of course, this test is quite simplistic, and it was on a development build addressing a couple other scalability bottlenecks. Your queries likely do more expensive stuff than just COUNT(*), which would make the improvement more modest. But the message is clear - the size of the cache clearly matters.

Which (finally) gets me to the motivation - tuning the file descriptor cache size requires information about the number of hits/misses. And we have none of that information available in Postgres. To get some of this data you have to profile the backends using perf, strace, etc.

And that’s what this patch idea is about - to make this data available in the pg_stat_ system catalogs, so that people can monitor the cache hit ratio, and adjust the configuration parameter if needed.

There’s probably much more information that can be provided about the file descriptor cache, I haven’t thought about it too much.

Implementation

I think the implementation should not be too difficult, and needs to touch about three places:

  • storing the statistics in shared memory
  • places that open/close the files need to update counters
  • define the system views to make the information accessible

First, let’s talk about where the statistics would be stored. That’s an already solved problem - Postgres has a pgstat module doing that for other monitoring statistics. To add this new information, you’d need to add a new struct with the new counters to pgstat.h, along with an API to update those counters from places.

Then you’ll need to make sure the counters are updated from places that deal with the file descriptor cache. I think this should happen either in fd.c (functions like PathNameOpenFile), or maybe higher in md.c (functions like _mdfd_getseg). I’m not 100% sure what’s the correct layer to do this.

Finally, you’ll need to add the system views to make the collected info accessible from SQL. This happens by defining the view in system_views.sql, and implementing a couple C functions returning the data.

Obviously, you don’t have to invent how to do these things - we already do the same thing for other data. What I’d do is pick a similar data, also collected at the instance level, and copy and modify the relevant bits in the pgstat module. For example, the bgwriter seems like a good candidate, so find everything in pgstat.h that has “bgwriter” in the name, make a copy with “vfdcache” instead of “bgwriter” (or something like that) in the name, and tweak it as needed. Same for the pgstat.c and system_views.sql files.

Once you have that, the main question will be where to call the _hit() and _miss() functions (added to pgstat) from.

Risks

I don’t think there’s a risk the implementation would be impossible or even particularly complex - we collect similar stuff elsewhere, so there’s no reason why it wouldn’t work here.

There are other risks, though. It’s possible someone might argue this information is not as useful as I think it is, and this feature is not worth the extra complexity. I don’t think it’ll happen.

What might happen is a discussion about what other information we should collect and at what granularity. I mentioned only information necessary for calculating cache hit ratio at the instance level. But maybe we should be collecting this for individual backends (or even queries in pg_stat_statements), and so on.

I don’t think that’s a problem. Such discussions are what happens with patches, and maybe the patch will need such improvements. But I’d not worry about that at the beginning - do a simple working patch first, it can be improved later.

Conclusions

So that’s the second patch idea. If needed, you can reach out to me directly by email. There’s also a number of places where you can talk to a bigger group of Postgres developers, for example the pgsql-hackers, or the new discord channel. There are many similar channels, on various other social platforms.

Do you have feedback on this post? Please reach out by e-mail to tomas@vondra.me.