Importing Postgres mailing list archives
A couple weeks ago I needed to move my mailing list communication to a different mailbox. That sounds straightforward - go to the community account and resubscribe to all the lists with the new address, and then import a bit of history from the archives so that the client can show threads, search etc.
The first part worked like a charm, but importing the archives turned out to be a bit tricky, and I ran into a bunch of non-obvious issues. So here’s how I made that work in the end.
Why import archives?
The mailing list archives are available on the community website. You can browse them, search them, and even re-send individual messages (and then reply). So if that works for you, maybe you don’t need to import anything.
I prefer to import a limited amount of history into my IMAP mailbox. Some of the development discussions span quite a bit of time, and it’s more convenient to see the discussion in one place. The client (I use Thunderbird) can also show threads in a structured way, while the website shows only a flat view. For discussions that branch a lot that’s much harder to follow.
Doing the import
Let’s say you decided to import the archives. You need to go to the
website, and download the mbox
files for a selected list. Those are
available for each list, e.g. for pgsql-hackers.
How many files should you download? Up to you, you could download the complete history, but the archives are pretty large. I chose to upload about 3 years of history, that’s enough 99% of the time.
Once you download the necessary files, we need to import this into the
mailbox. And that’s where we start running into issues, because the
mbox
format is specified only very loosely. AFAIK it evolved very
organically, when different MTAs used the same overall idea but handled
some things differently. So there are ambiguities.
I’m going to talk about the things that didn’t work first. If you’re only looking for the solution, skip to the mutt and maildir section.
Import/Export Tools NG in Thunderbird
I’m using Thunderbird, so the obvious option was to do the import using
the client itself. AFAIK Thunderbird can’t import mbox
files on its
own, but it has an add-on called Import/Export Tools NG
to do this. After installing the add-on, you can import individual mbox
files, or even import all files from a directory. Which is handy, as you
may want to import many files.
Unfortunately, this has various issues. For some of the mbox
files the
import stops half-way through, without any alert or indication why.
I guess this happens when the add-on runs into a message it can’t parse,
or something like that. I haven’t found any log explaining the failure,
but I haven’t looked very long, because there’s this second issue:
Yes, those are attachments, parsed and loaded as individual messages.
It’s common to use git format-patch
to generate patches, and attach
those to a message. Those messages start with a header like this:
From 98a361f95c2c4969488c2286f8aa560b45f8c0a8 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 8 Aug 2024 00:32:22 +0100
Subject: [PATCH v240118 2/4] Increase NUM_LOCK_PARTITIONS to 64
...
But that’s very similar to message header in the mbox
file:
From pgsql-hackers-owner+archive@lists.postgresql.org Tue Jan 01 00:39:09 2019
Received: from malur.postgresql.org ([217.196.149.56])
by arkaria.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256)
(Exim 4.89)
(envelope-from <pgsql-hackers-owner+archive@lists.postgresql.org>)
The patch has a commit hash in the From
field, but otherwise it’s
quite similar. I haven’t looked at the code, but I’d bet the add-on
simply splits the file on lines starting with From
, gets confused
by the attachment headers, and treats them as separate messages.
Not great, I’d really like attachments to remain attachments.
Import/Export Tools NG in Thunderbird with .eml
It occurred to me that the add-on can also import EML
files, which is
essentially each message in a separate file with .eml
extension. In
that case the add-on is obviously not responsible for parsing messages,
working around the issue.
I wrote a simple python script, which takes a list of mbox
files, and
splits them into .eml
files.
#!/usr/bin/python
import sys
import os
import re
cnt = 0
# read list of filenames passed to the script
for d in sys.argv[1:]:
# open the file in binary for reading
with open(d, 'rb') as f:
message = b''
# accumulate lines into a message, until we hit start of the
# next message (we know what From value to expect)
for l in f.readlines():
if l.startswith(b'From pgsql-hackers-owner+archive'):
if message != b'':
cnt += 1
# write the message into #.eml file
with open(str(cnt) + '.eml', 'wb') as o:
o.write(message)
message = b''
# add the line to the 'current' message
message += l
Then we can import all .eml
files from a directory. And that almost
almost works - the patches are no longer treated as separate messages.
But after checking this more thoroughly, I noticed some of the threads
are broken - treated as two separate threads. Not a huge issue, but
annoying. And I also found a couple “bogus” messages that seemed more
like a small part of an original message.
At this point I abandoned the idea of importing the archives through Thunderbird. There clearly are parsing issues, and I had no intention to learn how to fix add-ons or Thunderbird itself.
Mutt to the rescue?
When I asked around, someone suggested that mutt
can parse our mbox
files correctly. It took many tries, but we finally managed to tweak
the output format just the right way.
I gave it a try, and opened the archive mbox
file using
mutt -R -f pgsql-hackers.mbox
Unfortunately, I got this:
That doesn’t look promising - those are git format-patch
attachments
treated as separate messages, just like with the Thunderbird approach.
Mutt to the rescue!
But hey! With Thunderbird we also tried splitting the mbox
into .eml
files ourselves, and then passed the result to the client. Let’s try
that with mutt
too.
To do that, we can use maildir
- a mailbox format, where each message
is a separate file. Maildir is a directory with three subdirectories -
cur
, new
and tmp
. So let’s create that:
mkdir -p maildir/{cur,new,tmp}
Now copy all the .eml
files into maildir/cur
, and run mutt
on it.
mutt -R -f maildir
Heureka!
IMAP import
The final step is to actually import the messages into the IMAP mailbox.
I created .muttrc
to tell mutt
how to connect to the IMAP server:
set imap_user=MY_EMAIL
set imap_pass=MY_PASSWORD
set folder=imaps://SERVER:993/pgsql/hackers
mailboxes imaps://SERVER/pgsql/hackers
The pgsql/hackers
is a folder (and subfolder) I created on the IMAP
server, where I want to load the archives.
The easiest way I found to then copy messages in mutt
is to tag all
messages (T
and then ~A
), and then save the messages to an IMAP
folder (s
and ?
to select a folder).
Not only did this work, it was also pretty fast - much faster than doing this in Thunderbird.
Conclusions
So, that’s it. I really hope that if I need to do this again in a couple years, I will remember I already solved all these problems and wrote the instructions down in a blog post.