Tomas Vondra

Tomas Vondra

blog about Postgres code and community

Importing Postgres mailing list archives

A couple weeks ago I needed to move my mailing list communication to a different mailbox. That sounds straightforward - go to the community account and resubscribe to all the lists with the new address, and then import a bit of history from the archives so that the client can show threads, search etc.

The first part worked like a charm, but importing the archives turned out to be a bit tricky, and I ran into a bunch of non-obvious issues. So here’s how I made that work in the end.

Why import archives?

The mailing list archives are available on the community website. You can browse them, search them, and even re-send individual messages (and then reply). So if that works for you, maybe you don’t need to import anything.

I prefer to import a limited amount of history into my IMAP mailbox. Some of the development discussions span quite a bit of time, and it’s more convenient to see the discussion in one place. The client (I use Thunderbird) can also show threads in a structured way, while the website shows only a flat view. For discussions that branch a lot that’s much harder to follow.

Doing the import

Let’s say you decided to import the archives. You need to go to the website, and download the mbox files for a selected list. Those are available for each list, e.g. for pgsql-hackers.

How many files should you download? Up to you, you could download the complete history, but the archives are pretty large. I chose to upload about 3 years of history, that’s enough 99% of the time.

Once you download the necessary files, we need to import this into the mailbox. And that’s where we start running into issues, because the mbox format is specified only very loosely. AFAIK it evolved very organically, when different MTAs used the same overall idea but handled some things differently. So there are ambiguities.

I’m going to talk about the things that didn’t work first. If you’re only looking for the solution, skip to the mutt and maildir section.

Import/Export Tools NG in Thunderbird

I’m using Thunderbird, so the obvious option was to do the import using the client itself. AFAIK Thunderbird can’t import mbox files on its own, but it has an add-on called Import/Export Tools NG to do this. After installing the add-on, you can import individual mbox files, or even import all files from a directory. Which is handy, as you may want to import many files.

thunderbird mbox import

Unfortunately, this has various issues. For some of the mbox files the import stops half-way through, without any alert or indication why. I guess this happens when the add-on runs into a message it can’t parse, or something like that. I haven’t found any log explaining the failure, but I haven’t looked very long, because there’s this second issue:

attachments as messages

Yes, those are attachments, parsed and loaded as individual messages. It’s common to use git format-patch to generate patches, and attach those to a message. Those messages start with a header like this:

From 98a361f95c2c4969488c2286f8aa560b45f8c0a8 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 8 Aug 2024 00:32:22 +0100
Subject: [PATCH v240118 2/4] Increase NUM_LOCK_PARTITIONS to 64
...

But that’s very similar to message header in the mbox file:

From pgsql-hackers-owner+archive@lists.postgresql.org Tue Jan 01 00:39:09 2019
Received: from malur.postgresql.org ([217.196.149.56])
        by arkaria.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256)
        (Exim 4.89)
        (envelope-from <pgsql-hackers-owner+archive@lists.postgresql.org>)

The patch has a commit hash in the From field, but otherwise it’s quite similar. I haven’t looked at the code, but I’d bet the add-on simply splits the file on lines starting with From, gets confused by the attachment headers, and treats them as separate messages.

Not great, I’d really like attachments to remain attachments.

Import/Export Tools NG in Thunderbird with .eml

It occurred to me that the add-on can also import EML files, which is essentially each message in a separate file with .eml extension. In that case the add-on is obviously not responsible for parsing messages, working around the issue.

I wrote a simple python script, which takes a list of mbox files, and splits them into .eml files.

#!/usr/bin/python

import sys
import os
import re

cnt = 0

# read list of filenames passed to the script
for d in sys.argv[1:]:

        # open the file in binary for reading
        with open(d, 'rb') as f:
                message = b''

                # accumulate lines into a message, until we hit start of the
                # next message (we know what From value to expect)
                for l in f.readlines():
                        if l.startswith(b'From pgsql-hackers-owner+archive'):
                                if message != b'':
                                        cnt += 1

                                        # write the message into #.eml file
                                        with open(str(cnt) + '.eml', 'wb') as o:
                                                o.write(message)
                                                message = b''

                        # add the line to the 'current' message
                        message += l

Then we can import all .eml files from a directory. And that almost almost works - the patches are no longer treated as separate messages. But after checking this more thoroughly, I noticed some of the threads are broken - treated as two separate threads. Not a huge issue, but annoying. And I also found a couple “bogus” messages that seemed more like a small part of an original message.

At this point I abandoned the idea of importing the archives through Thunderbird. There clearly are parsing issues, and I had no intention to learn how to fix add-ons or Thunderbird itself.

Mutt to the rescue?

When I asked around, someone suggested that mutt can parse our mbox files correctly. It took many tries, but we finally managed to tweak the output format just the right way.

I gave it a try, and opened the archive mbox file using

mutt -R -f pgsql-hackers.mbox

Unfortunately, I got this:

mutt

That doesn’t look promising - those are git format-patch attachments treated as separate messages, just like with the Thunderbird approach.

Mutt to the rescue!

But hey! With Thunderbird we also tried splitting the mbox into .eml files ourselves, and then passed the result to the client. Let’s try that with mutt too.

To do that, we can use maildir - a mailbox format, where each message is a separate file. Maildir is a directory with three subdirectories - cur, new and tmp. So let’s create that:

mkdir -p maildir/{cur,new,tmp}

Now copy all the .eml files into maildir/cur, and run mutt on it.

mutt -R -f maildir

mutt working

Heureka!

IMAP import

The final step is to actually import the messages into the IMAP mailbox. I created .muttrc to tell mutt how to connect to the IMAP server:

set imap_user=MY_EMAIL
set imap_pass=MY_PASSWORD
set folder=imaps://SERVER:993/pgsql/hackers
mailboxes imaps://SERVER/pgsql/hackers

The pgsql/hackers is a folder (and subfolder) I created on the IMAP server, where I want to load the archives.

The easiest way I found to then copy messages in mutt is to tag all messages (T and then ~A), and then save the messages to an IMAP folder (s and ? to select a folder).

Not only did this work, it was also pretty fast - much faster than doing this in Thunderbird.

Conclusions

So, that’s it. I really hope that if I need to do this again in a couple years, I will remember I already solved all these problems and wrote the instructions down in a blog post.

Do you have feedback on this post? Please reach out by e-mail to tomas@vondra.me.