Find duplicate files by name

xzased · 07-24-2010, 12:14 AM

Hi, we have a huge amount of duplicate files in a folder and I would like some pointers on to writing a bash script to create a list of the duplicate files. I've seen examples that check for the md5 sum of files... but I dont need that, the file name is enough. Can someone please help me?

Telengard · 07-24-2010, 12:29 AM

Quote:

Originally Posted by xzased

Hi, we have a huge amount of duplicate files in a folder and I would like some pointers on to writing a bash script to create a list of the duplicate files ... the file name is enough.

Umm, it should not be possible to have two files with the same name in the same folder.

xzased · 07-24-2010, 12:59 AM

LOL! True Mr. Telengard. I meant in subfolders. So I have the directory /storage which holds about 10 subfolders which hold around 3 more subfolders each with around 300 + files in. Messy, I know. So the duplicates are between subfolders.

Telengard · 07-24-2010, 01:06 AM

Okay, this has had only very minimal testing so use at your own discretion. As always, it is your responsibility to evaluate this code's suitability for your purposes.

Code:

#!/bin/bash

find -type f > names.lst

while read name
do
    bn="$( basename "$name" )"
    name2="$( grep "$bn" names.lst | grep -v "$name" )"
    if [ "$name2" != "" ]
    then
	echo "$name"
	echo "$name2"
    fi
done < names.lst

I bet 3 Internets that someone else will have a much more elegant solution for you by tomorrow.

xzased · 07-24-2010, 01:35 AM

Wow, thanks sir. Your help is appreciated.

gabolander · 10-19-2012, 04:54 AM

Quote:

Originally Posted by Telengard

Okay, this has had only very minimal testing so use at your own discretion. As always, it is your responsibility to evaluate this code's suitability for your purposes.

Code:

#!/bin/bash

find -type f > names.lst

while read name
do
    bn="$( basename "$name" )"
    name2="$( grep "$bn" names.lst | grep -v "$name" )"
    if [ "$name2" != "" ]
    then
	echo "$name"
	echo "$name2"
    fi
done < names.lst

I bet 3 Internets that someone else will have a much more elegant solution for you by tomorrow.

Hi Telengard, your script is good, but there's a little problem: all the duplicated filenames are displayed twice (Obviously .. if basename of files are duplicated in the list, they fall twice doing "grep" on the same list).
Starting from your script (thanks!

), I applied some little change in order to display the duplicated couples only once. In my variant, also file size are shown.
Hoping to help somebody else, I paste the code hereafter:

Code:

#!/bin/bash

find -type f > names.lst

> names.out

TAB=`echo -ne "\t"`
while read name
do
    bn="$( basename "$name" )"
    name2="$( grep "$bn" names.lst | grep -v "$name" )"
    if [ "$name2" != "" ]
    then
                        if !(grep -q "^${name}${TAB}" names.out); then
                                size=`stat --format=%s "$name"`
                                size2=`stat --format=%s "$name2"`
                                echo -e "$name${TAB}$size" >> names.out
                                echo -e "$name2${TAB}$size" >> names.out
                        fi
    fi
done < names.lst

cat names.out
rm -f names.lst

nt4boy · 11-30-2012, 08:03 AM

Guys,

Telengard's script is definitely the business, found just what I was looking for, but I am afraid I cannot get gabolander's revision to run.

My Centos 5.8 seems to have an issue with grep in thisline:-

if !(grep -q "^${name}${TAB}" names.out); then

At least I am getting a grep error repeated and since the syntax from the orginal script is unchanged, I surmise that its the line I've pasted.

I'd appreciate some advice on this please.

Thanks

gabolander · 11-30-2012, 08:30 AM

Quote:

Originally Posted by nt4boy

Guys,

Telengard's script is definitely the business, found just what I was looking for, but I am afraid I cannot get gabolander's revision to run.

My Centos 5.8 seems to have an issue with grep in thisline:-

if !(grep -q "^${name}${TAB}" names.out); then

At least I am getting a grep error repeated and since the syntax from the orginal script is unchanged, I surmise that its the line I've pasted.

I'd appreciate some advice on this please.

Thanks

This is very weird really.. I just tested my script on a updated CentOS 5.8

Code:

[root@srv-rti /tmp/test]# lsb_release -a
LSB Version:    :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
Distributor ID: CentOS
Description:    CentOS release 5.8 (Final)
Release:        5.8
Codename:       Final

[root@srv-rti /tmp/test]# uname -a
Linux srv-rti.comune.rimini.it 2.6.18-308.16.1.el5 #1 SMP Tue Oct 2 22:01:43 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

And the line with "if !(grep -q ...)" is a standard way of grepping something into Bash scripts ....

Are you sure to have prefixed "#!/bin/bash" in the first line of the script? (I don't wish it runs with standard sh, that could not have the same sintax at all of Bash extensions ... )

This is the result of running my script on two dup files in two subdirectories:

Code:

[root@srv-rti /tmp/test]# find_dups 
./a/hobbitclient.cfg    1612
./b/hobbitclient.cfg    1612
./a/prova       423
./b/prova       423

Cheers,
G.

nt4boy · 11-30-2012, 08:57 AM

This is pasted from my script, I do expect you instantly to point at my foolishness!

#!/bin/bash

find -type f > names.lst

> names.out

TAB='echo -ne "\t"'
while read name
do
bn="$( basename "$name" )"
name2="$( grep "$bn" names.lst | grep -v "$name" )"
if [ "$name2" != "" ]
then
if !(grep -q "^${name}${TAB}" names.out); then
size='stat --format=%s "$name"'
size2='stat --format=%s "$name2"'
echo -e "$name${TAB}$size" >> names.out
echo -e "$name2${TAB}$size" >> names.out
fi
fi
done < names.lst

cat names.out
# rm -f names.lst

nt4boy · 11-30-2012, 09:27 AM

Right, I've a better result now.
I cut and pasted the original code into windows notepad....got in a state with line breaks and learned DOS2UNIX, but anyway, now got its precsiely into Linux, and it does run.

Sorry to mess you about.

However, while I scan the terminal window I see messages that it was unable to stat such and such a file and no such folder exists and the file name is too long, but maybe that's correct if there is no duplicate?

So, if there are no duplicates and I've a series of sub folder with thousands of files, it would be even better if the lines where there are no duplicates were not written to the output file.

the Duplicates do however end up with their stats on following lines.

Thanks

nt4boy · 12-05-2012, 06:31 AM

All,

Just to round this off, I failed to get this to work on my set up so looked around some more and in the end http://www.perlmonks.org/?node_id=855401 provided me with exactly what I needed.

Had to edit the 1st post following the posters, but here is what worked for me:-

#!/usr/bin/perl
use strict;
use warnings;

use File::Compare;
use File::Find;

#If you want to set a base_directory, you can do so here.

my $base_directory;

print "What directory? ";
my $directory = <>;
chomp $directory;

my %files;
sub files_wanted {
my $raw_file = $File::Find::name;
if ( -f ) {
my ($volume,$directories,$file) = File::Spec->splitpath($raw_file);
#update from a prior suggestion.
my $file_size = -s $raw_file;
push @{$files{"$file ($file_size bytes)"}}, $raw_file;
}
}
#If you set a base directory above, you will need to change

find(\&files_wanted,$directory);

open (MYFILE, '>>dupes.txt');

#This section searches the hash for any file with 2 or more files which share the same filename.ext and size.
#After that, it compares all of the files with those attributes to determine if they share the same contents.
#It will print the list of files with the same filename and size and will tell you which ones share the same
#contents.

for my $file (sort keys %files) {
if (@{$files{$file}} > 1) {
my $amount = @{$files{$file}};
print (MYFILE) "$file\t\t$amount\n";
for my $location1 (@{$files{$file}}) {
print (MYFILE) "\t$location1\n";
for my $location2 (@{$files{$file}}) {
unless ($location1 eq $location2) {
if (compare($location1,$location2) == 0) {
print (MYFILE) "\t\tExact copy: $location2\n";
}
}
}
}
print (MYFILE) "\n";
}
}
close (MYFILE);