Despite all I’ve done to move my email to my own domain and hosting, inevitably some messages still arrive in the Gmail account I’ve had for more than a decade. I’ve already configured the account to send replies from my new addresses, but I also wanted to archive the 215,000+ messages already stored with Google, along with anything new that arrived there.
Options considered before
One solution is Google’s Takeout service, which will produce an archive of everything stored in Gmail (and many of Google’s other services, too!), but this process is manual and can be very slow. Takeouts can only be created through a web interface; downloading the archive requires doing so in the browser (for authentication reasons); and since it isn’t creating incremental backups, every message is included in every Takeout. An archive of just my Gmail account takes about 29 hours for Google to prepare, amounts to nearly 7 GB (in gzipped tar format), and takes several hours to download to my laptop. I’ve then another several hours to upload the archive to my backup server. While I’m willing to undertake this process once a month to back up all services that Takeout supports–which entails two files totaling around 30 GB–Takeout is impractical for regular exports of a frequently-changing service like Gmail.
Since Gmail supports the IMAP protocol, I also could’ve used any number of open-source scripts to synchronize messages to my mailserver. While practical, this was also a non-starter. As I noted in “The volume dilemma,” part of hosting my own email hinged on keeping messages organized. Injecting thousands of emails into an existing mailbox would immediately undo my efforts, and even though I could create a new inbox just for these synchronized messages, both of these approaches introduce a new concern: disk space. At the moment, my Gmail account uses 13.5 GB on Google’s servers; I’d rather not waste that space on my mailserver when the intent is to create a backup. I have a backup server for that. 🙂
Fortunately, despite the relative viability of the preceding options, a third, and far-superior solution, emerged.
gmvault is an open-source (GPL v3.0) command-line tool to synchronize the contents of a Gmail (or Google Apps) account. It allows for “quick” replications–to cover recent messages–and full archives, can capture chat history, and is able to restore messages to either the original or an independent account. Perhaps most importantly, while it manages the archive in its own way, it provides an export option that supports both the
maildir formats; owing to this, I could navigate a Gmail archive using familiar tools such as Dovecot, which already powers my mailserver.
Caveat: I’m only concerned with running
gmvault on a server; the remainder of the post reflects this assumption.
gmvault is written in Python, making it easily installable via pip:
pip install gmvault
Next, create a user specifically for running
gmvault, whose home directory will store the archives:
useradd -m -d /home/gmvault -s /usr/sbin/nologin -U gmvault
Now that the binary and user exist, start the first backup, which will prompt for the initial configuration. I recommend running this in a
screen session, as the process will likely take some time.
sudo -u gmvault -H gmvault sync -d /home/gmvault/gmvault-db-example email@example.com
To allow for backing up multiple Gmail accounts, I specify the
-d flag to change the default directory away from
/home/gmvault/gmvault-db; this isn’t necessary if you have only one account to archive.
Once the initial backup completes, scheduling two different types of synchronization, on two different schedules, provides an appropriate balance between real-time backups and efficient use of resources.
The command shown in the last section performs a complete sync each time it’s called, which on a large account, can take an hour or more to complete (after the initial run). To allow for faster, incremental backups,
gmvault provides a “quick” sync option 1.
To enable the quick sync, specify the
-t flag in your
sudo -u gmvault -H gmvault sync -t quick -d /home/gmvault/gmvault-db-example firstname.lastname@example.org
Now, add the following lines to your system’s
10 * * * * sudo -u gmvault -H gmvault sync -t quick -d /home/gmvault/gmvault-db-example email@example.com 20 3 * * 0 sudo -u gmvault -H gmvault sync -d /home/gmvault/gmvault-db-example firstname.lastname@example.org
The first entry triggers a quick sync at ten minutes past every hour, whereas the second runs a full backup every Sunday at 03:20. I’ve chosen these times rather arbitrarily, after running the two methods for a few days to measure how long each took to complete. In my case, the quick sync completes in under five minutes, and the full sync in less than 40 minutes, so the two processes won’t overlap on Sunday mornings. If necessary, I’d switch the incremental backup to every other hour, but it’s been fine thus far.
So that I’m aware of any failures, I pipe
gmvault’s output to an email for later review. Doing so is just a matter of appending the following to the existing
2>&1 | mail -s "Gmvault report" email@example.com
The first part,
2>&1, ensures that any errors are redirected to what becomes the email body, alongside
gmvault’s output; the balance of this addition should be self-explanatory (aside, I silence
cron’s default email output, hence this approach).
Since setting this up, I’ve had only one issue with the scheduled backups. Periodically, the synchronization will fail, with
gmvault reporting an authentication error. Every time this has happened, the next run has succeeded. For this reason, I suspect that Google is rate-limiting my server’s access to its systems, as I’ve taken no action to address the supposed authentication error. As I noted earlier, I could change the frequency of the incremental backups, but the failures are too intermittent to warrant this just yet.
Image by Gualtiero Boffi, used with permission.