My school (UW) recently allowed students to switch their school email accounts over to one of 2 providers: Google (via gmail) or Microsoft (via Outlook Web Access). Feeling the inexorable pull of the cloud, I decided to take the plunge and switch to gmail, but one thing held me back. I have used Thunderbird for a few years now and have accumulated roughly 1GB of archived messages that are generally unimportant but occasionally vital. I’ve also espoused Lifehacker’s Trusted Trio email organization system and so many of the emails in my archive are tagged for easier organization.
I’d like to have this email archive searchable from gmail. And I would really like to have all those tags I’ve accumulated over the years show up as gmail “labels”. Fortunately, this isn’t too difficult thanks to the following:
- Thunderbird stores message tags in a special
X-Mozilla-Keysheader in the email message itself. So it’s easy to recover the tags by parsing the email.
- Google provides an email migration API for uploading messages into some hosted Google Apps gmail accounts. Unfortunately, “This API is only available to Google Apps Premier, Education, and Partner Edition domains, and cannot be used for migration into Google Apps Standard Edition email or Gmail accounts” sayeth the Google. You also have to have your Google Apps administrator enable email migration by end users – be default, only administrators can migrate email.
There are a few other nice thing about using the migration API (as opposed to just copying the emails from one IMAP folder to another). One is that time/date information is preserved. With an IMAP copy, emails all appear to have arrived on the day that the copy was made. Also, metadata can be preserved (e.g. messages flagged “Important” in your mail client can be translated into “starred” messages in gmail). Overall, your/my obsessive-compulsive side will be much happier if you can use the migration API.
So, I wrote a little Python script program that uses the Google email migration API to upload a bunch of emails from Thunderbird (or any ol’ mbox file, but it only understands how to read “tags” set by Thunderbird) into a hosted gmail account.
- the Google Python API (I used version 2.0.4)
- the actual python script (for the curious, a darcs repository is also available)
You need to ensure that Python can find the code in the Google Python API. The simplest way is to set your PYTHONPATH environment variable to include the root directory of the API files. For example, if you extract the Google API files into the same directory as the tbird2gmail.py script, any you’re using bash, you can run everything with:
PYTHONPATH=$PYTHONPATH:gdata-2.0.4 python tbird2gmail.py arguments...
The tbird2gmail script works by taking a series of mbox files as command line arguments, and uploads their contents to the specified hosted gmail account. You need to provide your login information (email address and password). One particularly useful feature is to specify (via the
--label command line flag) a label to be applied to all uploaded messages. In case the upload fails for some reason (see below) it is easy to find all the emails that were part of the upload, so you can e.g. delete them and try again.
I used this script to upload roughly 30,000 emails (about 1GB in size) into my UW gmail account. Emails are uploaded in batches of configurable size; I found that when using a batch size any larger than about 2MB, the gmail server would get overwhelmed and close the connection. So I shrunk the batch size, added a delay in between batch sends, and the thing ran happily all night. The theoretical maximum batch size allowed by the Google API is 32MB.
This is definitely a hack; the Google API docs say that the server should return a 503 error when it becomes overworked, but in practice I found this was not triggered reliably before the connection was simply terminated.
This script should not be used if the underlying mbox files can be modified by another application. Doing so could result in a corrupted mailbox file, costing you all your mail! So close down your mail client before uploading any email.
Gmail will also reject mail that does not conform to the RFC822 specification. I had a few emails that were missing proper
Date: headers, and others that had invalid attachments (generally bounce messages). tbird2gmail will log these failed emails into a new mbox file, where you can edit them and then try uploading again. The failure reason returned from the Google server is stored in a special
X-Tbird2gmail-Upload-Failure-Reason header in the message itself. You can read more about what these failure codes look like at http://code.google.com/apis/gdata/docs/2.0/reference.html#HTTPStatusCodes.
Update (3 May 2010): Lisa, another grad student here at UW, has written some python scripts for migrating mbox files into Google Apps that don’t depend on the now-deprecated interface in the Google Python libraries; just pure Python 2.6. You can find the scripts here.