Checking for file duplicates with MD5sum

Questions about applications and software
Forum rules
Before you post read how to get help. Topics in this forum are automatically closed 6 months after creation.
Locked
User avatar
grimdestripador
Level 6
Level 6
Posts: 1051
Joined: Fri Feb 16, 2007 2:26 am

Checking for file duplicates with MD5sum

Post by grimdestripador »

Recently I had a drive fail on me, as a last ditch effort i used a dd copy with ddrescure to get whatever I could off the drive between failures.
I also had the contense of this disk copied to DVDs before it went bad. Some DVDs where lost/broke/scratched.


I have a mixture of files most of which are identical copies.

Objective:
(1) Remove Duplicate files

Tools I wish to use:
() Shell based commands or scripts
() md5sum to create a hash of all the files in a directory, exported with filename and md5sum
() Comparitor to check for multiple entries of a hash
if entriy exists, move one entry to a to_be_deleted list.

So First step, how to I have md5 sum make a hash for each file?
Last edited by LockBot on Wed Dec 28, 2022 7:16 am, edited 1 time in total.
Reason: Topic automatically closed 6 months after creation. New replies are no longer allowed.
User avatar
grimdestripador
Level 6
Level 6
Posts: 1051
Joined: Fri Feb 16, 2007 2:26 am

Re: Checking for file duplicates with MD5sum

Post by grimdestripador »

http://elonen.iki.fi/code/misc-notes/re ... ate-files/

and so I found this... any help adjusting it?

Code: Select all

OUTF=rem-duplicates.sh;
echo "#! /bin/sh" > $OUTF;
find "$@" -type f -print0 |
  xargs -0 -n1 md5sum |
    sort --key=1,32 | uniq -w 32 -d --all-repeated=separate |
    sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF;
chmod a+x $OUTF; ls -l $OUTF
User avatar
grimdestripador
Level 6
Level 6
Posts: 1051
Joined: Fri Feb 16, 2007 2:26 am

Re: Checking for file duplicates with MD5sum

Post by grimdestripador »

i also found a way to print md5sum

Code: Select all

find . ! -empty -type f -printf "%s " -exec ls -dQ {} \; | sort -n | uniq -D -w 1 | cut -d" " -f2- | xargs md5sum | sort | uniq -w32 -d --all-repeated=separate | cut -c35-

or from ubuntu forums, This script has another problem, though: it doesn't work when it finds a filename containing a ' (an apostrophe).
I've improved my duplicate file finding "one liner" (just about) so that it is 1000's of times faster.. The obvious flaw with the original was that it was checksumming every single file to find the duplicates. The easy optimisation is to only consider processing files which have the same size...

Code: Select all

find . ! -empty -type f -printf "%s " -exec ls -dQ {} \; | sort -n | uniq -D -w 1 | \
cut -d" " -f2- | \
xargs md5sum | sort | \
uniq -w32 -d --all-repeated=separate | \
cut -c35-
If anyone's interested here's each step of the pipeline explained:

Code: Select all

# Prints out the size and filename of each file found on the path.
# and sort using the filesize as the key, then using uniq to
# only leave filenames with the same size in the pipeline
find . ! -empty -type f -printf "%s '%p'\n" | sort -n | uniq -D -W 1 | \

# Trim off the file size in preparation for next stage
cut -d" " -f2- | \

# Create the checksum for the files of the same size and then sort
xargs md5sum | sort | \

# Strip out any checksums that are unique, leaving only the duplicates
uniq -w32 -d --all-repeated=separate | \

# Strips out the checksum part, just leaving the duplicate filenames
cut -c35-
I've tried it a couple of times, and I think it works ;)

You might want to give a size argument to the first find to only report files bigger than a certain size (e.g. 1 megabyte):

Code: Select all

find . -size +1M -type f -printf "%s '%p'\n" .....
or

JUniq - Duplicate file remover:
JUniq generates a shell script with all the files to be deleted.

java -jar /<path to>/juniq.jar
http://www.blisstonia.com/software/JUniq/

Unix shell script for removing duplicate files:
This script generates a second script with all the files to be deleted. It is the fastest method, but you are required to manually pick which files to delete, so it is actually the slowest method. It is probably a minor code change to pre-select all the duplicate files. This would also make a good Nautilus script!
http://elonen.iki.fi/code/misc-notes/re ... ate-files/

DupFinder.exe:
This utility gets the job done, if you can map a drive from an Ubuntu machine onto a Windows machine.
http://support.microsoft.com/default.as ... 6121121120
Locked

Return to “Software & Applications”