[SOLVED] How to find duplicate files and delete all except most recent version

anon091 · 08-17-2010, 08:18 AM

I have a directory containing a ton of photos, some of which are duplicates but just with different names. Is there any way in linux to find all the duplicates and remove all of them except the most recent version? I know on Windows there are utilities that will do this through a GUI, but I'm using Linux through the CLI only.

kilgoretrout · 08-17-2010, 08:53 AM

There are some bash scripts out there that do that. Here's one that I found ages ago called dupimage:

Code:

#!/bin/sh

CWD=`pwd`
SORTING=/tmp/sorting
OUTPUT=/tmp/filesfound
DELETE=/tmp/delete
DUPLICATE_DIR=~/duplicates
COUNT=0
F_COUNT=0
DIR_COUNT=0
EXT_COUNT=1

##################################################################
	if [ ! -d $DUPLICATE_DIR ]; then
		mkdir -p $DUPLICATE_DIR
	fi

##################################################################
# remove any previous output files
	rm -rf $OUTPUT
	rm -rf $SORTING
	rm -rf $DELETE

##################################################################
echo
echo "Duplicate Image Finder"
echo
echo "Press enter for current directory"
echo "Or enter directory path to scan: "
read ANSWER
if [ "$ANSWER" == "" ]; then
	ANSWER="$CWD"
fi

##################################################################

# find images
find $ANSWER -type f -name '*.[Jj][Pp][Gg]' >> $SORTING

IMAGES_TO_FIND=`cat $SORTING`
	for x in $IMAGES_TO_FIND; do 	# generate a md5sum value and sort each file found and add it to the output file
		COUNT=$(($COUNT + 1 ))
		MD5SUM=`md5sum $x | awk '{print $1}'`
		echo $MD5SUM $x >> $OUTPUT
	done

##################################################################

# find duplicates in output file
cat $OUTPUT | sort | uniq -w 32 -d --all-repeated=separate | sed -n '/^$/{p;h;};/./{x;/./p;}' | awk '{print $2}' >> $DELETE

FILES_TO_DELETE=`cat $DELETE`
	for FILE in $FILES_TO_DELETE; do
		NAME=`basename $FILE`
		F_COUNT=$(($F_COUNT + 1 ))
			if [ ! -e $DUPLICATE_DIR/$NAME ]; then # check to se if file name exist in duplicate directory before trying to move
				mv $FILE $DUPLICATE_DIR
			else
				# if file exists strip the file extension so we can rename the file with a -1 to the end
				ORG_NAME=`basename $FILE | cut -d "." -f 1` # get the name and strip off the file extension
				FILE_EXT=`basename $FILE | cut -d "." -f 2` # get the file extension type
				NEW_NAME="$ORG_NAME-$EXT_COUNT.$FILE_EXT"
					while [ -e $DUPLICATE_DIR/$NEW_NAME ]; do
						EXT_COUNT=$(($EXT_COUNT + 1 ))
						NEW_NAME="$ORG_NAME-$EXT_COUNT.$FILE_EXT"
					done
				mv $FILE $DUPLICATE_DIR/$NEW_NAME
			fi
	done

##################################################################
# remove empty directories if they exist
EMPTY_DIR=`find $ANSWER -depth -type d -empty`
	for EMPTY in $EMPTY_DIR; do
		D_COUNT=$(($DIR_COUNT + 1 ))
		rm -rf $EMPTY
	done

echo "Number of Files Checked: $COUNT"
echo "Number of duplicate files deleted/moved: $F_COUNT"
echo "Number of empty directories deleted: $DIR_COUNT "

##################################################################

Edit:Giving credit where credit is due, this is where I found this script along with a discussion:

http://www.linuxquestions.org/questi...-files-519144/

anon091 · 08-17-2010, 08:57 AM

Thanks for the script. I have a question though, where in the script does it tell it to keep only the most current version of each duplicate?

kilgoretrout · 08-17-2010, 09:20 AM

I doesn't as far as I can see. It just compares md5sums of all files and removes all files with identical md5sums except for one of them. Actually it doesn't delete them; it just moves them to a duplicate image directory that it creates in your home directory. If they have identical md5sums they should be identical files, i.e. the same image, so the time stamp shouldn't matter.

Just rereading that thread that I posted in the Edit portion of my prior post, fotoguy improved his script from the version I posted. Here'e the revised script:

Code:

#!/bin/bash

CWD=`pwd`
FILESFOUND=/tmp/filesfound.txt
FILESSIZE=/tmp/filessize.txt
DUPLICATE_SETS_FOUND=/tmp/duplicate_sets_found.txt
DUPLICATES_TO_DELETE=/tmp/duplicates_to_delete.txt
DUPLICATE_DIR=~/duplicates
COUNT=0
F_COUNT=0
DIR_COUNT=0
EXT_COUNT=1

##################################################################
if [ ! -d $DUPLICATE_DIR ]; then
	mkdir -p $DUPLICATE_DIR
fi

##################################################################
# remove any previous output files
rm -rf $FILESSIZE
rm -rf $FILESFOUND
rm -rf $DUPLICATE_SETS_FOUND
rm -rf $DUPLICATES_TO_DELETE

##################################################################
echo
echo "Duplicate Image Finder"
echo
echo "Press enter for current directory"
echo "Or enter directory path to scan: "
read ANSWER
if [ "$ANSWER" == "" ]; then
	ANSWER="$CWD"
fi

##################################################################
# rename any directory name with spaces with an underscore
find $ANSWER -type d -iname '* *' -exec sh -c 'mv "$1" "${1// /_}"' -- {} \; 2> /dev/null
# rename any files name with spaces with an underscore
find $ANSWER -type f -iname '* *' -exec sh -c 'mv "$1" "${1// /_}"' -- {} \; 2> /dev/null

##################################################################
# find images
 for x in `find $ANSWER -type f -name "*.[Jj][Pp][Gg]"`; do
	COUNT=$(($COUNT + 1 ))
 	ls -l "$x" | awk '{print $5,$8}' >> $FILESFOUND
 	ls -l "$x" | awk '{print $5}' >> $FILESSIZE
done

# if no images files are found just exit script
if [ ! -e $FILESFOUND ] || [ ! -e $FILESSIZE ]; then
	echo "No image files found..........exiting"
	exit
fi

# find duplicate sets and remove one entry so as not to remove the original with subsequent duplicates
cat $FILESSIZE | sort | uniq -w 32 -d --all-repeated=separate | uniq > $DUPLICATE_SETS_FOUND
 for f in `cat $DUPLICATE_SETS_FOUND`; do
	grep "$f" "$FILESFOUND" | awk 'a ~ $1; {a=$1}' | awk '{print $2}' >> $DUPLICATES_TO_DELETE
done

#	if no duplicates are found exit script
if [ ! -e $DUPLICATES_TO_DELETE ]; then
	echo "Number of files scanned: $COUNT"
	echo "No duplicate files found"
	exit
fi

# instead of deleting move to the duplicate directory for inspection, just have to delete manually
for FILE in `cat $DUPLICATES_TO_DELETE`; do
	NAME=`basename $FILE`
	F_COUNT=$(($F_COUNT + 1 ))
			if [ ! -e $DUPLICATE_DIR/$NAME ]; then # check to se if file name exist in duplicate directory before trying to move
				mv $FILE $DUPLICATE_DIR
			else
				# if file exists strip the file extension so we can rename the file with a -1 to the end
				ORG_NAME=`basename $FILE | cut -d "." -f 1` # get the name and strip off the file extension
				FILE_EXT=`basename $FILE | cut -d "." -f 2` # get the file extension type
				NEW_NAME="$ORG_NAME-$EXT_COUNT.$FILE_EXT"
					while [ -e $DUPLICATE_DIR/$NEW_NAME ]; do
						EXT_COUNT=$(($EXT_COUNT + 1 ))
						NEW_NAME="$ORG_NAME-$EXT_COUNT.$FILE_EXT"
					done
				mv $FILE $DUPLICATE_DIR/$NEW_NAME
			fi
done

##################################################################
# remove empty directories if they exist
 EMPTY_DIR=`find $ANSWER -depth -type d -empty`
 	for EMPTY in $EMPTY_DIR; do
 		DIR_COUNT=$(($DIR_COUNT + 1 ))
 		rm -rf $EMPTY
 	done

echo "Number of Files Checked: $COUNT"
echo "Number of duplicate files deleted/moved: $F_COUNT"
echo "Number of empty directories deleted: $DIR_COUNT "

##################################################################

anon091 · 08-17-2010, 10:13 AM

Thanks