Search
Email
This form does not yet contain any fields.
    Blog Index and Feed
    Thursday
    Apr122012

    Free file deduplication utility available on github

    I finished the Software Engineering class with Software as a Service (Part 1) with www.coursera.org recently and got a certificate.  I learned a lot about web development (like the Model-View-Controller design pattern), Ruby (Rspec, Cucumber, poetry mode, ...), and Agile methods (test-driven design, behavior driven development, user stories, etc).  I also got a free github account, and I can use it to release my own code for people to use.

    So, without further ado, here is the link to my github account:

    https://github.com/flowingriver

    The first thing I put up there is a little Python command-line utility called deduper.py that is used to find duplicate file contents under some directory.  It will find all potential duplicate file contents, but it could find a few non-duplicate files, so watch out.  

    I made this little script when I consolidated files into one central data storage partition.  Linux has a better NTFS driver now, so dual-booting with Windows and Linux is made easier since you can share data files across a common NTFS partition.  I consolidated files from Linux and Windows on such a partition, and I had files on USB sticks from my old laptop that I put on there.  I'd been collecting files and copying files over the years in school, so now I had duplicate PDFs and duplicate other stuff on my NTFS partition.  This utility helped me find the duplicates and delete them to save space and get more organized.

    It has fully customizable command-line parameters, and it uses the SHA-256 cryptographic hash function by default to search for duplicate files.  It hashes the concatenation of a customizable amount of bytes from each file and the file length.  Alternately, it can use the MD5 hash function, but SHA-256 is more collision resistant.  I think it's faster too.  I've been learning all about this stuff from the Crypto class from coursera.org, so this utility was a good first application of some of this knowledge.

    I wrote all about it in the README.txt file.  There's no need to download any special modules, it runs with native Python 2.7 throughout.

    Download deduper.py here: https://github.com/flowingriver/Python-Utilities/blob/master/deduper.py

    I hope to add some more miscellaneous stuff on github.  There's a puzzle-solving algorithm that I wrote a while ago that I'll put on there eventually.  It solves a puzzle from a recently released game.  More on this later.  Stay tuned. 

    Tuesday
    Jan102012

    Programming Languages In Retrospect

    The following are some of the things that I've done with programming languages in the past.

    Java
    data structures (lists, graphs, hashtables, sets, heaps, priority queue, queue)
    typical algorithms (Kruskal, Dijkstra, BFS, DFS, merge sort, quicksort, insertion sort)
    made modifications to Neighbor-Joining type algorithms (Computational Biology)
    database access
    file I/O
    XML parsing
    thread programming (computer science class)
    scheduler algorithms (computer science class)

    Python
    typical data types: dictionaries, lists, tuples, strings, ints, floats, etc.
    logic: for, if-then
    class hierarchy
    regular expression matching
    file I/O
    web page parsing 
    compilation algorithms (earley parser, lexers, context-free grammars)
    numerical algorithms (numpy, optimization)
    Django (basic tutorial)
    wxPython (really really basic tutorial) 
    computational biology algorithms (protein knot-finding, gene searching, etc.)

    C/C++
    worked with machine structures and memory allocation in the class CS 61C 
    modified software called MJOIN to experiment on it 
    parallel computation with OpenMP at the UC Berkeley parallel computing boot camp 

    Javascript
    created a compiler in Python for Javascript with a context-free grammar for a computer science class

    XML, SQL, XPath, XQuery
    searches, sorts, access, inserts, deletes, views, updates, triggers, etc.

    Octave/Matlab
    basic data types and logic
    numerical linear algebra algorithms (Lanczos process, eigenvalue estimation, grad school class)
    worked with machine learning algorithms from a Stanford online class (i.e. linear regression, logistic regression, artificial neural networks, support vector machines)
    statistical modeling using singular value decomposition
    some optimization algorithms 

    Perl
    implemented the Smith-Waterman edit distance algorithm 

    R
    implemented simulated annealing algorithm, statistical calculation on phylogentic trees, and a protein-protein docking algorithm 

    Maple
    computed phylogenetic invariants
    matrix algebra 

    MIPS

    TI-89 Basic (Hooray for High School)
    created calculation programs for pre-calculus and algebra such as the quadratic equation, parabola properties, conic section properties
    wrote and developed text based adventure games

    Tuesday
    Jan102012

    New Stanford Online Classes

    Stanford is offering several online classes once again starting late January.  There is one on probabilistic graph models, which is a subject relevant to machine learning.  You can scroll down to the bottom of a class to see other course offerings.  They are geared towards undergraduates, and this semester there are several interesting classes in cryptography, business, computer security, and econometric game theory among others.  They are unaccredited.  I intend to take or audit a few of these, time permitting.  The computer security class is offered in C++, and I hope to learn more about the language from it.  The probabilistic graph models class is relevant to research that I've done in computational biology, and it should enhance my understanding of machine learning and statistics.

    Sunday
    Oct162011

    Fall Classes

    I'm taking Stanford Online classes in Computer Science this "fall".  I'm taking the database class (db-class.com), the machine learning class (ml-class.com), and the artificial intelligence (AI) class (ai-class.com).  These are all high quality classes, and I'm happy for the opportunity to take them.  With both machine learning and databases, I'm taking the advanced track with homeworks and assignments.  With the AI class, I'm just taking the lectures, the Basic version, since I need time to do other things.  AI is useful for stuff I might do, and there are many algorithms that I haven't seen before; however, machine learning is more relevant.  Learning some new algorithms from AI shouldn't be too challenging since I'm more of an algorithm/theory person anyhow.  Learning about databases in more depth will be helpful and is something that I might not do on my own.  So far, the machine learning class isn't very challenging since it reviewed ordinary least squares, which I already know about.  The database class, on the other hand, is challenging since I'm not an expert on SQL (Standard Query Language), and writing queries can be tricky.

    Wednesday
    Aug172011

    Website up

    My website went up around August 05, 2011.