Free file deduplication utility available on github
Thursday, April 12, 2012 at 01:02PM I finished the Software Engineering class with Software as a Service (Part 1) with www.coursera.org recently and got a certificate. I learned a lot about web development (like the Model-View-Controller design pattern), Ruby (Rspec, Cucumber, poetry mode, ...), and Agile methods (test-driven design, behavior driven development, user stories, etc). I also got a free github account, and I can use it to release my own code for people to use.
So, without further ado, here is the link to my github account:
https://github.com/flowingriver
The first thing I put up there is a little Python command-line utility called deduper.py that is used to find duplicate file contents under some directory. It will find all potential duplicate file contents, but it could find a few non-duplicate files, so watch out.
I made this little script when I consolidated files into one central data storage partition. Linux has a better NTFS driver now, so dual-booting with Windows and Linux is made easier since you can share data files across a common NTFS partition. I consolidated files from Linux and Windows on such a partition, and I had files on USB sticks from my old laptop that I put on there. I'd been collecting files and copying files over the years in school, so now I had duplicate PDFs and duplicate other stuff on my NTFS partition. This utility helped me find the duplicates and delete them to save space and get more organized.
It has fully customizable command-line parameters, and it uses the SHA-256 cryptographic hash function by default to search for duplicate files. It hashes the concatenation of a customizable amount of bytes from each file and the file length. Alternately, it can use the MD5 hash function, but SHA-256 is more collision resistant. I think it's faster too. I've been learning all about this stuff from the Crypto class from coursera.org, so this utility was a good first application of some of this knowledge.
I wrote all about it in the README.txt file. There's no need to download any special modules, it runs with native Python 2.7 throughout.
Download deduper.py here: https://github.com/flowingriver/Python-Utilities/blob/master/deduper.py
I hope to add some more miscellaneous stuff on github. There's a puzzle-solving algorithm that I wrote a while ago that I'll put on there eventually. It solves a puzzle from a recently released game. More on this later. Stay tuned.
School,
Technology