Deduping mutt with rmlint
May 9, 2020
bash
linux
mbsync
mutt
rmlint
I want to preface this by saying that any sort of command line mucking with your email should begin with a full backup. Part of the reason I moved to mutt was to make my local mailbox the source of truth for my email, and thus making me explicitly responsible for backing it up.
cp -r .mail .mail_backup
Since the mailbox format is stored entirely in text files, and I only have about 6k emails, I just created a new git repo in my mailbox and added everything. Git is actually perfect for this because it’s designed to track changes to many small text files. After each step in this process I just made a new commit. A bonus of this is you can see what changes are made by various mbsync commands.
You have been warned.
Why I like mutt
So I’ve been using mutt as the primary email client on one of my systems for a few months now. It took a while to set up but I love the simplicity of it. I still use some other mail clients if I need to view an HTML email for some reason but plain text is great for most things.
Being a CLI app means it’s extremely fast, I can navigate via keyboard, I can access it remotely with ssh. It’s also free from tracking pixels and HTML, which means when I get a phishing email the first thing I see is the phony URLs and not the legit-looking graphics. You can even set it up allow you to edit the headers on outgoing messages by default, which is not super useful but it’s kind of a neat trick and means I can easily send messages from anything@mydomain.
What I was trying to do
I started out syncing my local mailbox from Fastmail using mbsync but only the INBOX folder, not the Spam/Archive/Trash etc. Once I got a bit more confident I decided to modify my workflow so when I archive a message locally, it gets moved to the Archive folder and removed from the Inbox, and then these changes get pushed back to the IMAP server.
I modified my .mbsyncrc from Sync Pull to Sync All, held my breath and ran mbsync -a
First problem: duplicate folders
It worked! Except somehow I ended up with some duplicate folders. Folders that only differ by capitalization, like sent/Sent, archive/Archive, etc. I think locally mutt created the lowercase ones and Fastmail created the capitalized ones and now I have messages in each.
My solution to this (pro tip: don’t do this) was to simply merge the folders.
mv sent/\* Sent
rmdir sent
Problem solved (ha ha ha)
Second problem: duplicate UIDs
My guess is that mutt assigns unique identifier that are only unique per folder. I started seeing warnings when syncing:
Maildir error: duplicate UID 1.
A short google search later and I found a blog post explaining that the UIDs are stored in the name of the message file (cytokine is/was the name of my laptop) eg:
1588211305.311279_1450.cytokine,U=5065
The part at the end with U=5065 is the UID assigned by mutt. To fix duplicate UIDs (said the blog post) simply rename the file and remove the UID, and mutt will regenerate it.
I did this manually for the first duplicate UID, and then the second, and then I realized that mutt would only point out the first duplicate UID and then quit (it wouldn’t give you a list of all duplicate UIDs)
Discovering the actual second problem through trial and error
A large number of duplicate UIDs was easy enough to fix (I thought, stupidly) using perl-rename: I’ll just strip all the UIDs.
perl-rename 's/,U=1\\:.\*//' \*/cur/\*'
Unfortunately what I didn’t realize was that at some point I had actually created duplicates of some messages, and that the duplicate UIDs were a symptom, not the root cause.
The fix
If you are trying to fix duplicate messages in a mailbox, start here.
There’s a great tool called rmlint (hopefully available from the repos of your Linux distro of choice - I use Void currently) that will scan a folder for duplicate files (based on hashing) and remove duplicates, leaving one copy of each.
Unfortunately it didn’t work right away - the initial scan said every file had a unique hash. I knew this was not true - I picked a specific message that I’d found multiple copies of from an order I placed with a beer delivery company.
$ .mail for x in $(rg -1 'Dark Chocolate Porter' | xargs); do md5sum $x; done
b858c00d45b21eadcf96c186bd709680
d66e470eb0e2769d41bea62bac3e4511
38d5f16ec67048383fd3ablc505cc379
ef78b3f12ac5035a222d37048466887d
INBOX/cur/1588032979.192567_5.cytokine,U=4102
INBOX/cur/1588109825.63404 2.cytokine,U=4112
Archive/cur/1588354986.675229 40.cytokine,U=8054
Archive/cur/1588354985.675229 31.cytokine,U=8045
I hashed the two copies of the same message and sure enough they produced different hashes. A quick diff showed why: an email header called X-TUID.
$ diff INBOX/cur/1588032979.192567 5.cytokine,U=4102 Archive/cur/1588354986.675229 40.cytokine,U=8054
180c180
< X-TUID: YTrvfoK18ZZk
---
> X-TUID: vgFviD4sjG3d
Checking a few more messages confirmed that this header was the only thing preventing these files from hashing to the same value.
I was able to remove the headers pretty easily with another sed command:
sed -i '/X-TUID/d' \*
» .mail_git git:(master) for x in $(rg -1 'Dark Chocolate Porter' | xargs); do md5sum $x; done
7b60541880d15ca6986002ee773d9bc9 INBOX/cur/1588032979.192567 5.cytokine,U=4102
90ce114b6cf5da32e7b2bffdb6d3d2dl INBOX/cur/1588109825.63404 2.cytokine,U=4112
7b60541880d15ca6986002ee773d9bc9 Archive/cur/1588354986.675229 40.cytokine,U=8054
90celldb6cf5da32e7b2bffdb6d3d2dl Archive/cur/1588354985.675229 31.cytokine,U=8045
Now rmlint would behave correctly, but I wasn’t sure how mbsync would behave without this. Would I just cause more confusion? Then I realized that rmlint doesn’t actually delete the files when you run the command - it generates a script to do it after the fact. This feature never made sense to me before but it was perfect for this use case.
What I did was:
- Make a copy of my mailbox
- Use sed on the copy to strip the X-UID headers
- Run rmlint on the copy but don’t execute the script it produces yet
- Copy the script back to the original mailbox where the X-UID headers were still intact and run it there
This worked well because the script just removes filenames and the filenames weren’t changed when I removed the X-UID headers, so the script had no idea it wasn’t deleting the same files that were compared to generate it.
==> Note: Please use the saved script below for removal, not the above output.
==> In total 6315 files, whereof 101 are duplicates in 86 groups.
==> This equals 8.24 MB of duplicates which could be removed.
==> 17 other suspicious item(s) found, which may vary in size.
==> Scanning took in total 6.138s.
I verified that this had actually worked (also using git, which again, is perfect for this) and then set mbsync to sync changes in both directions.
TL;DR:
The lessons I learned here include not deleting UIDs and regenerating them until you’re sure that messages are actually not duplicates, and also why rmlint defaults to not running the actual removal right away.
Also when experimenting with your mailbox, git is actually the perfect backup tool because it’s all plain text and you can see what changed in between revisions.
I still haven’t figured out what X-UID is for but I wouldn’t be surprised if base64 decoding it yields the same number that’s encoded in the filename.