- Compute the checksum of all files
- List files with matching checksums
- Group all files by size
- Do a partial comparison of all files of a given size, quickly excluding obvious non-matches
- Complete the comparison for files that look equivalent so far, listing matches
Well, not really. I've been wanting to get re-acquainted with Python for a while now (for various reasons), and I figured this was a good excuse. How hard could it be? As it turns out, not very.
- qdupe - A command-line utility to quickly find duplicate files, written in Python and inspired by FastDup.
So how does it compare to FastDup?
Out of curiosity, I ran both over my DVD library, which is currently at about half a terabyte. I ran each twice, back to back, in order to see the effects of the OS's buffer cache. They both found 911 dupes, adding up to about 500MB. The first time I ran them, they each took about a minute. The second time, FastDup took 3.0 seconds and qdupe took 3.6 seconds.