The state of the Postgres community
About a month ago I presented a keynote at Swiss PGDay 2024 about the state of the Postgres community. My talk included a couple charts illustrating the evolution and current state of various parts of the community - what works fine and what challenges will require more attention.
Judging by the feedback, those charts are interesting and reveal things that are surprising or at least not entirely expected. So let me share them, with a bit of additional commentary.
I do speak at Postgres conferences fairly regularly, and my talks tend to be about the technical stuff - features, performance, … But when I was invited to do a keynote at Swiss PGDay 2024 a couple months back, I knew I didn’t want to do one of my “usual” talks. Keynotes should be a little bit special and different. Otherwise why have a special keynote slot at all, right? So after thinking about it for a couple days, I decided to do a keynote about the evolution and current state of our community.
I have not attended the Swiss PGDay conference before. I’ve heard from multiple people they really liked it, both as attendees and speakers. But it never worked for me time wise - it’s in a busy part of the year, right after other events that I need to attend, and right before the summer holidays, …
But giving a keynote was a good excuse to go anyway this year, and I was not disappointed. It really is a very nice well-organized event, and I really enjoyed meeting all the people. I 100% agree with what Laurenz Albe wrote in his post about the event (highly recommend reading it). And it’s an amazing location - next to a lake, with a view of the Alps on the other bank.
So If you’re in Switzerland (or somewhere close), I certainly recommend attending the conference next year. OK, if you’re in Switzerland you’re probably used to the gorgeous environment with mountains and lakes, but go to the conference anyway.
Back to the original topic …
In the keynote I (very) briefly went through the history of Postgres, and how I think that affected the structure of the community. And then I presented a handful of charts with some simple metrics that I believe illustrate where we’re at - commits, committers, mailing lists, etc.
I certainly don’t claim this is a comprehensive description of the whole community. After the keynote I realized it might seem that when I speak about the “community” I only include those working on the core C code. That was not my intention.
The community is huge, and there’s no chance to cover it in ~45 minutes reserved for the keynote. And I’m not familiar with every part of the community to do a good analysis, so I focused on the parts that I’m most familiar with.
If you want to check the whole talk, the slides are available here. It’s hard to understand slides without the additional commentary, though (this applies to all slides, not just this keynote).
Now, let’s finally look at the charts.
Commits per year
First, let’s look at the number of commits per year, as calculated by
git
itself, using a command like
$ git log --pretty='format:%cd' --date=format:'%Y' \
| uniq -c | awk '{print $2 " " $1}'
This assumes commits are reasonably similar over time on average, i.e. about the same size/complexity, etc. I did check some other stats provided by git (number of files changed, number of lines inserted and deleted). The charts looked almost exactly the same, so I believe the assumption is reasonable.
So, what does the chart say? My interpretation is that the pace of the development activity is remarkably stable. We did have a peak in 2001, with almost 3000 commits, followed by gradual slowdown to ~2000 commits per year. And we have maintained that pace for the last ~10 years.
There were two “bumps” in 2009 and 2013. I don’t have a good explanation for those, supported by data (I haven’t looked very long, though). But if I had to guess, I’d say the 2009 bump may be related to 8.4 slipping by a couple months, and people taking some time off afterwards to recover.
I’m not sure about 2013, the 9.3 (or 9.4) releases did not slip. But I note that we reworked the commitfest app in 2014, so perhaps there were some issues with the old one, slowing down the activity? Not sure.
Active committers
Another important measure of development activity is the number of
people participating. For example, how many people committed stuff to
the git
repository over time?
If you count “active committers” (which I defined as those who committed something for a given month, relying on the git “committer” field), you’ll get something like this:
The first observation is that the number of active committers is smaller than people sometimes assume. The fact is there are about 31 committers as of now, so getting ~23 of them active per month is reasonable. That’s not a lot, considering the popularity of Postgres.
This also shows that the number grows - in 1996 when the open source project started, we had maybe 2-3 active committers, now we have 23. And the growth is fairly steady, especially since ~2005, which is nice.
Would it be better to have more committers? Absolutely! Some of the challenges the community is facing are due to limited capacity for patch review etc. And that’s related to the number of (experienced) contributors, which is where committers come from.
Is the number of active committers proportional to the number of experienced contributors? I don’t know, but if the answer is yes, would that mean that by growing the number of contributors we end up with more committers?
Also, remember the number of commits per year is pretty stable, at about ~2000 commits / year. With the number of active committers growing, it might seem individual committers are doing less stuff on average.
Which might happen for a number of reasons - for example the patches might be more complex (so requiring more work). Or maybe the committers are spending more time working on tasks from their employer (and thus have less time to commit stuff to Postgres).
But it might also be a bit of an illusion - the “average committer” may not exist. There are a couple very active committers responsible for a significant fraction of commits. And it’s true this fraction is getting smaller over time (which I think is healthy and good for the project). That however means the gap is filled by other committers (to maintain the 2000 commits per year figure). Which is the opposite of the “amount of work drops on average” hypothesis.
If you are interested in more statistics about the development, I highly recommend visiting Robert’s blog. For the last couple years he published “Who Contributed to PostgreSQL Development” summaries, with per-committer and per-contributor metrics: (2016, 2017, 2018, 2019, 2020 and 2021, 2022, 2023).
Active contributors
I don’t think counting active committers is sufficient to assess the development activity. Committers are an important but a fairly small part of the Postgres community, and handle just the very last step in the life of a patch.
I’m not sure what’s the best way to count contributors. We are very careful to mention all contributors in each commit message, but there’s no uniform way to do that in a structured way. Some committers do, but many just mention people in the commit message directly. I don’t know how to extract that in an automated way.
Since PostgreSQL 10, release notes include a list of contributors - patch authors, committers, reviewers, testers, people who reported issues. In the 7 releases since then, there’s about 320-420 names in each release, which means there’s at least 15x more contributors than committers.
This however still omits the people who didn’t make it to the release notes for some reason. They might not cross the threshold to be included in the list. Or maybe they proposed a patch that did not get committed (and despite that it can be very helpful).
If you have an idea how to improve this, and properly recognize a wider group of contributors, I’d love to hear it.
Patches by status
People participating in the development are not the only thing we can count. We can also count the patches, of course. At the beginning, the development relied on the pgsql-hackers mailing list. But mailing lists are a great place to discuss random stuff, not a great place to track stuff. It wasn’t uncommon for a patch to get forgotten (including bugfixes).
To improve that, in 2008 we adopted the concept of a “commitfest” - a month-long period of development work, followed by a pause before the next cycle begins. Each major version has several such commitfests (currently 5, which takes a year). And to help with organizing these cycles, we developed a “commitfest app” (CFA) - a place where we register and track patches.
That means we now know the “status” for each patch. I’m not going to pretend the status is always up to date / accurate (particularly the needs review and waiting on author can go stale). But at the end of each commitfest this we know if a patch was committed, rejected or moved to the next commitfest. So the uncertainty goes away.
We have this data for all the past commitfests since 2008, and when plotted (for individual commitfests), it looks like this:
The number of committed and rejected patches seem somewhat stable (more about that in a bit). But there’s a pretty visible change for patches moved to the next commitfest in 2014.
Why there were no moved patches before 2014 is very simple. The old application used for tracking patches did not have such a state. Instead, patches were rejected at the end of the commitfest, and had to be resubmitted for a future one (if the author wished to continue working on it). So some patches did move, but it was a bit hidden.
Then in 2014 we decided we don’t like having to do that manually, and we added the moved state. It might not have been the intent, but this became the “default” state for patches without a clear state at the end of the commitfest. Don’t know what the patch needs to move it forward? Don’t know if it has any chance to get committed? Don’t know if the author is still interested in working on it? No problem, just move it to the next CF, and it’s solved.
Obviously, it didn’t work particularly well - the number of patches that just move to the next CF is growing over time. Patches that went through 10+ commitfests are not an exception.
I’m not suggesting we should stop moving patches like this. Some of this might be due to not having enough people capable of reviewing such complex patches. Some of the patches may not have a chance of getting committed, but no one is willing to tell the bad news. People may be struggling with the development process itself. Or maybe they are just discussing the patch and everything is fine.
But it’s a clear sign of a bottleneck somewhere, and we need to do something about it. Having commitfests with 350+ patches, ~70% of which just move from commitfest to commitfest, is not free either. For example, finding a patch to review was never easy, and this is making it even harder. That is not great for anyone.
I’m not the only person who realizes we need to improve this somehow. Back in May, Robert started an interesting thread explaining why the commitfest.postgresql.org is no longer fit for purpose. And the challenge of finding something to review is the thing he starts with. If even developers as experienced as Robert struggle, that’s not great.
If you’re trying to understand the Postgres development process, I highly recommend reading that thread. It’s a great overview of how current developers think about both the process and addressing some of the issues.
A couple paragraphs back (right after the chart) I said the number of committed patches seem mostly stable, but that’ll get back to this. That chart is a bit difficult to read, because the numbers are dominated by the moved patches, making the other difference seem small. It’s also a bit misleading, because we’ve increased the number of commitfests per year a bit. We started with 4, now we have 5.
If you count the committed patches by year, you will get this chart (the 2024 data is incomplete, of course):
That looks very different - we’ve pretty much doubled the number of patches committed per year. That may not seem impressive, considering we’ve increased the number of committers ~5x in that period. But we’re also doing more complex features, paying more attention to testing etc.
Mailing list activity
Enough about the development activity, let’s talk about users. Since the very beginning, the primary way to communicate with the rest of the community (be it other users or devs) were mailing lists. That was what most open source projects did in 2000, and Postgres did that too.
If you load the mailing list archives and count the number of messages and people who sent at least one message (per year), you’ll get this:
Uh-oh! That does not look great. Both metrics peaked ~2005, after which there’s a very clear decrease - and it’s not one abrupt change but a continuous trend. The number of messages is down ~50%, for the number of people the drop is ~60%.
But let’s look at only the pgsql-hackers mailing list, and the chart looks very different:
The number of people drops a bit (compared to the peak in 2001-2003), by ~25%, but is pretty stable for ~15 years. And the number of messages actually grows by ~50% in the same period - if you are subscribed to pgsql-hackers, this probably does not surprise you. So the group of pgsql-hackers subscribers does not grow very much, but the people are more and more active (i.e. send more messages).
A natural question then is - which mailing lists got less active? The eight mailing lists that lost the most of the activity are these:
These are all user-oriented lists - general “help” lists for english and spanish-speaking users and lists to discuss problems with SQL and performance. What could explain this?
One option is that people suddenly learned to write SQL queries that have no performance issues, and that we’ve made it so perfect no one needs help. But that might be a bit too optimistic.
Or perhaps we’ve lost ~80% of the users, and they switched to some other database. But judging by how popular Postgres is getting, that does not seem likely either.
I believe the users simply don’t like mailing lists that much, and they may have switched to other places when they need help. Perhaps platforms like stackoverflow that are easy to join and don’t force you to receive messages from everyone else. Or maybe chat platforms like discord or slack, with a number of Postgres channels maintained by someone else.
I don’t think this is necessarily a problem in itself. Mailing lists are a bit arcane and not what new projects choose to do, and if the other places work better for users, I don’t see that as an issue.
What worries me a little bit is that we might lose an important feedback channel from the users, telling us what does (and does not) work, etc. Although, it’s not like those other places are hidden, a lot of the developers are present there too. So I’m not sure we lose the feedback in practice.
Conclusions
So, those were the charts. I hope it was interesting and perhaps shows both the good things and challenges in growing the community. Take the explanation with a grain of salt - a lot of it is speculation, or at least a very subjective/biased interpretation.