Tuesday, November 09, 2010

A simple file-level dedupe utility in Python

At home, I've been working on organizing my photo library and found FastDup to be a great little utility. You point it at a directory and it finds duplicate files with surprising speed. It works well because it's smart about not doing more work than it needs to. A naive dedupe utility (which, ahem, I may have written in Java a couple years ago to do similar work with my audio library) works like this:
  1. Compute the checksum of all files
  2. List files with matching checksums
A smarter approach is to:
  1. Group all files by size
  2. Do a partial comparison of all files of a given size, quickly excluding obvious non-matches
  3. Complete the comparison for files that look equivalent so far, listing matches
FastDup, which is written in C++, takes this approach. I compiled and ran it fine on my file server (an Ubuntu machine) and tried to compile it on my Mac, too...no luck. The author states in the README that it works in Linux and nowhere else, and the last release was a couple years ago, so it seemed I was out of luck.

Well, not really. I've been wanting to get re-acquainted with Python for a while now (for various reasons), and I figured this was a good excuse. How hard could it be? As it turns out, not very.

  • qdupe - A command-line utility to quickly find duplicate files, written in Python and inspired by FastDup.

So how does it compare to FastDup?

Out of curiosity, I ran both over my DVD library, which is currently at about half a terabyte. I ran each twice, back to back, in order to see the effects of the OS's buffer cache. They both found 911 dupes, adding up to about 500MB. The first time I ran them, they each took about a minute. The second time, FastDup took 3.0 seconds and qdupe took 3.6 seconds.