Will Postgres development rely on mailing lists forever?
Postgres is pretty old. The open source project started in 1996, so close to 30 years ago. And since then, Postgres has become one of the most successful and popular databases. But it also means a lot of the development process reflects how things were done back then. The reliance on mailing lists is a good example of this heritage. Let’s talk about if / how this might change.
The arcanity of mailing lists
Submitting the first patch into an established project can be hard. Not because the coding part is inherently difficult, but you may not be familiar with the established process. This is especially true for long-running projects, as the contribution process gets set up early and evolves only very slowly. After a couple decades it may feel a bit arcane to new people joining the project. No doubt this applies to Postgres too.
Mailing lists are a good example of this. Back in the late 90s mailing lists were what most open source projects did. You subscribed to a developer list and sent an email with your patch attached. Other people replied with reviews and other feedback, you responded and occasionally sent a new version, etc. And then someone with commit access picked up the patch and pushed it into the repository. That’s how it worked, and so everyone was familiar with the overall process.
Then, projects like GitHub and GitLab appeared, and with them the idea of an integrated contribution process based on “pull requests”. Many new open source projects found that process more convenient, adopted it, and mailing lists became much less common.
There’s an infinite discussion about the pros/cons of these development processes. Is one of them fundamentally better? My experience is that PRs work fine for smaller / straightforward patches. But once the patch gets more complex, PR reviews and flat (non-threaded) discussions get very confusing or just broken. That’s my personal / subjective opinion.
Another question is whether our project should rely on complex products (or services) developed by someone else. I’ll get back to this later.
In any case, Postgres still uses mailing lists. But the environment has changed, and mailing lists are no longer the default process used by every other open source project. Sure, the kernel and similar projects still use that. But new projects likely use something else.
The consequence is new contributors may not be very familiar with our development process. Is that a major issue / bottleneck? If so, what are we going to do about it? Will we abandon mailing lists altogether?
Evolution
Of course, relying on mailing lists does not mean the process did not change in various ways.
For example we adopted and developed various new tools. Back in 1996 the
project started to use CVS
version control system, because that was
the only widely available tool. There was no git
, Subversion
or
other modern VCS systems yet. For a while CVS
worked well enough, but
then it became obvious the new tools are more convenient. Thus the
project moved to git
in 2008.
We also built various auxiliary tools to help with managing stuff. The mailing lists are not particularly good (i.e. terrible) in tracking status of patches, which is what led to the commitfest app.
We even leverage some of the hosted features. For example, if you have a repository on GitHub, you can easily enable CI workflow to run all regression tests on multiple platforms. And we use the CI service to do automated testing of WIP patches.
Speaking of CI, buildfarm is another internal tool of ours, developed long before CI services became a thing. And even today it couldn’t be replaced by any of them, because we test so many rare platforms and combinations.
Those were just a couple examples, but I hope it shows the project is not standing still. There are changes, but gradual and incremental, so that we don’t break workflows too much.
Survivorship bias
It seems fair to mention the discussion about the development process is a great example of survivorship bias. Vast majority of those participating in the discussion are existing long-term developers, subscribed to the mailing list and familiar with the existing process. And they are likely reasonably comfortable with it, otherwise they wouldn’t be contributing.
I’m definitely in this group of “survivors”. I’ve been contributing for a long time, and the mailing list process feels very natural to me. My whole workflow is built in a way to make this convenient.
For me this is maybe a bit worse, because most of the projects I work on (and contribute to) follow about the same development process. And the couple times I tried to use the pull-request thing for a bit more complex patch, it was a pretty terrible experience. Reviews got broken after pushing a rebased/updated patch version, and the flat history of the discussion is … impossible to follow.
So I don’t have a very good idea of how a “good” pull-request workflow would look like. But I’m open to the idea that it exists, and I’m really grateful to the developers who propose improvements in that regard.
What’s the point, though?
That leads to the question: What’s the point of these changes? What do we want / expect to happen if we implement them?
I’ve heard suggestions that by accepting PRs on github (or elsewhere) we may get a lot of new contributors. Which in turn would greatly help the project because more developers means higher review bandwidth etc.
I’m a bit skeptical about this hypothesis. I can imagine PRs working for small drive-by patches - small fixes, clarifications of docs. But would that lead to becoming a long-term contributor? Not sure I see the path to that - not as clear as some of the proposals make it seem.
A fundamental reason for my skepticism is that I’m not convinced the number of motivated developers is the primary bottleneck. Go read the recent “commitfest.postgresql.org is no longer fit for purpose” thread on the developer mailing list. I think Robert is spot on that there are plenty of issues. How difficult it is to find a patch to review, for example. We have a tool to track patches people work on, but it’s very time-consuming to find a patch ready for a review.
And if it’s difficult for experienced developers, it’s going to be impossible for new contributors, no matter how motivated they are. I’d argue this may be the primary bottleneck about reviews. We actually have quite a few very motivated contributors on pgsql-hackers, but they run into these walls.
I don’t see how switching to PRs would help with that issue at all. But maybe there’s a project already doing this, and we could learn something from their experience?
Dependencies
By now you’re probably aware I’m not a huge fan of pull-requests. I’m open to the possibility it could be made to work with my workflow, even for complex patches.
However, there’s another reason why the community may prefer a simpler self-hosted process. A pull-request workflow would likely require using either a managed service, or at least a self-hosted version (say GitLab). Both options introduce dependence on another party, which brings risks.
For hosted services the risks are mostly obvious. The entity operating the service has its own set of incentives and interests. These may align with the project for a long time, but company strategy may also shift quickly and unexpectedly. There are past examples of exactly this happening. I don’t blame the companies - it’s well within their rights to reconsider, for any reason. But let’s not pretend there are no risks.
This is especially bad for hosted-only services, like GitHub, where you can’t just take the data and move. Even if you could export all the data, will you be able to import them elsewhere without losing anything? I’m not sure anyone can guarantee that.
Even with perfectly aligned interests, there’s legal risks. Companies may be subject to different (or more strict) rules in some cases. For example companies may be prohibited from providing services to customers from a certain country, due to sanctions. These restrictions may not apply to open source communities, non profits etc. The company may be unable to differentiate, so it will just disable access to everyone based on geolocation.
Would self-hosting a product like GitLab address these problems? It certainly would help, because the decisions are now made by you, not someone else. But some risks remain, and you get some new ones too.
The new problem is that you now have to operate the service yourself. Which requires resources, particularly the attention and time of the infrastructure team. Those people are usually already spread thin, so adding them yet more responsibilities is not great.
But what if the project gets abandoned? Sure, that’s not very likely for established projects like GitLab - I certainly don’t wish for that to happen, the project is awesome. But the Postgres project had some bad experience in the past with a custom GForge fork (earlier incarnation of that product).
So people may be a bit wary of repeating that experience.
Conclusions
So, what do I expect to happen in the future?
I don’t think we’ll just flat-out switch to pull requests Certainly not anytime soon. I expect incremental and gradual improvements to the process, done in a way that does not break workflows of current developers. Simply because the current developers are the first group you need to convince.
I also expect more “optional” improvements, adding various capabilities. I mean stuff like github CI workflow, which became a crucial part of my personal development workflow. Or even small improvements like the cfbot link to show visual diffs for patches on github.
We probably need to simplify the various custom tools we’ve developed over the years. We’ve kept adding features, some of which ended up a bit abandoned, but we never removed them. For example it’s possible to submit a review from the CFA, but I don’t know anyone using it. Perhaps we should clean this up, to make it easier to use (and less overwhelming for new developers).
We also need to integrate the tools better, to make it easier to manage patches. Right now the CFA and the mailing list have no shared concept of “patch status”, and it can easily diverge. Maybe there should be one? It’s not possible to subscribe to “notifications” about the state of a particular patch.
If someone believes pull requests are the way to go, I think it’d be very helpful to make it work for non-trivial patches, and advocate for it. I have my doubts, though …
I was wondering if “federation” would be a way to handle pull requests from new contributors. That is, if someone could create a custom fork (e.g. on github) to accept pull requests, curate them in some way, and then forward them to the main project? For drive-by patches that might be enough, and a good opportunity to “convert” the motivated contributors to the regular process.