Script to find duplicate files within one or more directories
Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Script to find duplicate files within one or more directories
Hi, has anyone got a script which does more or less the following please:
I have 2 directories with say about 1500 photos in each dir. Now, I know that some of the photos are the same even if they have different timestamps and names. To avoid laboriously doing a visual comparison, what I would like to do is run a scrit (bash or other) which will list the files which are similar. It could be that within each dir there are duplicates with differing names or/and that there are duplicates between the 2 dirs. Actually in theory one should be able to pass $1 $2 $3... as directories to compare, it need not be limited to one or two directories. Also, here I am talking of image files, however I woudl like the program to be generic to be able to compare any type of file in the directories being compared and I would guess that if using a md5 hash signature then the type of file is immaterial (please tell me if I'm talking nonse, I won't be offended ).
I was thinking a script could do this by creating a md5 or other hash checksum of the files in the directories and then compare each file to the stored checksums to create a list of files which have the same md5 value and hence should be identical. Perhaps someone knows existing functions with the unix/linux suite of tools such as the various shells or awk, perl, php, python etc. which I am not aware of..
If someone knows a progrom or a script I could run under w32 (WXP say) then that would be useful too as I can perform the task on either system and then move the files across if necessary. In anycase as I use both environments it would be useful to know how to do it in both.
If you need something that also runs on Windows, you can do essentially the same with Ruby, Python or Perl - whichever language you know best.
Also, note that there's a slight twist to using hashes for this: If the images were not just copied between the directories, but re-encoded, they can have different md5sums even though they look the same. If that's the case, you could try GQview, which has a "find duplicates" feature (that I've never tested).
Oow well, I'm trying to learn Ruby anyway and this looks like a nice exercise.
Code:
require 'digest/md5'
list = {}
Dir.foreach(ARGV[0]) do |file|
path = File.join(ARGV[0], file)
next if not File.file? path
list[Digest::MD5.digest(File.read(path))] = path
end
Dir.foreach(ARGV[1]) do |file|
path = File.join(ARGV[1], file)
next if not File.file? path
if otherpath=list[Digest::MD5.digest(File.read(path))]
puts path + "\t" + otherpath
end
end
Here is a little bit more complicated script. But what it returns in exchange for complexity is that it's less intense than the quick hack above. Note: I wrote this on FreeBSD so things might be a little bit off. You may want to check that the -ls option to find returns the size in the 7th spot and the filename in the 11th. Modify those values if they don't.
This program checks all subdirectories below the points you request. It also only sums the files which have matching sizes (which will very likely reduce the load tremendously as you're not hashing every file).
Code:
#!/bin/sh
#md5prog="md5 -r"
md5prog="md5sum"
old=-1
count=0
docompare() {
${md5prog} ${*} | sort | awk 'BEGIN{
prevsum=0;
dup=0;}
{if (prevsum==$1) {
printf "%s ",previous;
dup=1;}
else {
if (dup==1) {
printf "%s\n\n",previous;
}
dup=0;}
prevsum=$1;
previous=$2;}'
}
mainloop() {
new=${1}
if [ ${new} -eq ${old} ]; then
count=`echo "${count}+1" | bc`;
files="${files} ${2}";
else
if [ ${count} -gt 1 ]; then
docompare ${files};
fi
count=0;
files="";
old=${new};
fi
}
if [ x"${1}" = "x" ]; then
echo "Usage: `basename ${0}` {path1} [path2 ...]";
exit 1;
fi
find ${*} -ls | awk '{print $7 " " $11}' | sort -n | while read LINE
do
mainloop $LINE
done
Edit: A couple lines above might look odd to the more experienced Bash programmers because this script is /bin/sh compatible. And I couldn't use some of the neat little tricks you might know about.
After reading frob23's post, I also came up with the following one. It's the fastest solution I've tested so far, also comparing sizes first, does a recursive search of an arbitrary number of directories and adds some error checking:
Code:
#!/usr/bin/ruby
require 'digest/md5'
(puts "Usage: #{$0} {dir1} [dir2 ...]"; exit) if ARGV.empty?
sizes={}; prev=[]; dup=false
Dir[File.join("{#{ARGV.join(',')}}","**","*")].select{|f| File.file?(f)}.
each{|f| if sizes[size=File.size(f)] then sizes[size].push f else sizes[size] = [f] end}
sizes.each do |size,files|
next if files.length==1
files.map{|f| [Digest::MD5.digest(File.read(f)), f]}.sort_by{|p| p[0]}.each do |p|
if p[0]==prev[0]
prev[1] += "\t" + p[1]
dup = true
else
puts prev[1] if dup
dup = false
prev = p
end
end
end
Edit: A couple lines above might look odd to the more experienced Bash programmers because this script is /bin/sh compatible. And I couldn't use some of the neat little tricks you might know about.
Just out of curiosity, why not just say #! /bin/bash rather than /bin/sh and then use the tricks?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.