Page 1 of 1

Checking for file duplicates with MD5sum

Posted: Mon Oct 06, 2008 3:01 pm
by grimdestripador
Recently I had a drive fail on me, as a last ditch effort i used a dd copy with ddrescure to get whatever I could off the drive between failures.
I also had the contense of this disk copied to DVDs before it went bad. Some DVDs where lost/broke/scratched.


I have a mixture of files most of which are identical copies.

Objective:
(1) Remove Duplicate files

Tools I wish to use:
() Shell based commands or scripts
() md5sum to create a hash of all the files in a directory, exported with filename and md5sum
() Comparitor to check for multiple entries of a hash
if entriy exists, move one entry to a to_be_deleted list.

So First step, how to I have md5 sum make a hash for each file?

Re: Checking for file duplicates with MD5sum

Posted: Mon Oct 06, 2008 3:08 pm
by grimdestripador
http://elonen.iki.fi/code/misc-notes/re ... ate-files/

and so I found this... any help adjusting it?

Code: Select all

OUTF=rem-duplicates.sh;
echo "#! /bin/sh" > $OUTF;
find "$@" -type f -print0 |
  xargs -0 -n1 md5sum |
    sort --key=1,32 | uniq -w 32 -d --all-repeated=separate |
    sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF;
chmod a+x $OUTF; ls -l $OUTF

Re: Checking for file duplicates with MD5sum

Posted: Mon Oct 06, 2008 3:12 pm
by grimdestripador
i also found a way to print md5sum

Code: Select all

find . ! -empty -type f -printf "%s " -exec ls -dQ {} \; | sort -n | uniq -D -w 1 | cut -d" " -f2- | xargs md5sum | sort | uniq -w32 -d --all-repeated=separate | cut -c35-

or from ubuntu forums, This script has another problem, though: it doesn't work when it finds a filename containing a ' (an apostrophe).
I've improved my duplicate file finding "one liner" (just about) so that it is 1000's of times faster.. The obvious flaw with the original was that it was checksumming every single file to find the duplicates. The easy optimisation is to only consider processing files which have the same size...

Code: Select all

find . ! -empty -type f -printf "%s " -exec ls -dQ {} \; | sort -n | uniq -D -w 1 | \
cut -d" " -f2- | \
xargs md5sum | sort | \
uniq -w32 -d --all-repeated=separate | \
cut -c35-
If anyone's interested here's each step of the pipeline explained:

Code: Select all

# Prints out the size and filename of each file found on the path.
# and sort using the filesize as the key, then using uniq to
# only leave filenames with the same size in the pipeline
find . ! -empty -type f -printf "%s '%p'\n" | sort -n | uniq -D -W 1 | \

# Trim off the file size in preparation for next stage
cut -d" " -f2- | \

# Create the checksum for the files of the same size and then sort
xargs md5sum | sort | \

# Strip out any checksums that are unique, leaving only the duplicates
uniq -w32 -d --all-repeated=separate | \

# Strips out the checksum part, just leaving the duplicate filenames
cut -c35-
I've tried it a couple of times, and I think it works ;)

You might want to give a size argument to the first find to only report files bigger than a certain size (e.g. 1 megabyte):

Code: Select all

find . -size +1M -type f -printf "%s '%p'\n" .....
or

JUniq - Duplicate file remover:
JUniq generates a shell script with all the files to be deleted.

java -jar /<path to>/juniq.jar
http://www.blisstonia.com/software/JUniq/

Unix shell script for removing duplicate files:
This script generates a second script with all the files to be deleted. It is the fastest method, but you are required to manually pick which files to delete, so it is actually the slowest method. It is probably a minor code change to pre-select all the duplicate files. This would also make a good Nautilus script!
http://elonen.iki.fi/code/misc-notes/re ... ate-files/

DupFinder.exe:
This utility gets the job done, if you can map a drive from an Ubuntu machine onto a Windows machine.
http://support.microsoft.com/default.as ... 6121121120