Comparing JPEGs and finding matches... or not finding matches.
Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803
Rep:
Comparing JPEGs and finding matches... or not finding matches.
Here's one for someone better versed on the innards of JPEG files than I am.
The problem: I have multiple digital cameras and, rather than keep photos from each in separate directory trees, went about renaming each file using the EXIF exposure date/time information with the original filename tacked on the end. (The theory being that it would be highly unlikely I would have two different cameras having the same 4-digit exposure number in use at the same time while I'm pressing the shutter releases at exactly the same time.) Ahh... if only cameras gave you the opportunity to select something other than "IMG" as a filename prefix.
What I recently found: I have in the past gotten sloppy and offloaded photos from cameras without reformatting the card (or otherwise deleting the photos on the memory cards). Hence, when I offloaded the next bunch of photos into another directory I would, obviously, have two copies of some of the photos in different locations. Plus, a large part of this problem arose as I was running out of disk space so photos were getting offloaded to wherever I had free space until I purchased a pair of giant disks to make a dedicated raidset for photos. I figured that my renaming process would rename the files identically to the names previously generated for the first copies of the files offloaded from the camera---and it does. The plan was to merge all the various directory trees of photos into one directory tree on the big raidset.
Now here's where things get mysterious. If I obtain a checksum for two photos that have the same "new" filename based on the EXIF information, I find that they don't match. I find that in these cases, the photos have the same dimensions but have file sizes that are slightly different, i.e., not really identical. So... I ran `exif' on each photo and found that one of them had slightly different information. One photo had a "Software" item in the EXIF information dump that mentions Gnome F-spot. (Gee, thanks a lot Gnome F-spot... You were supposed to organize the photos; not modify them.) The "Date and Time" field--seen in both photos' EXIF information--also seems to record the date/time that F-spot touched the file as well. The size of that field doesn't correspond to the file size differences, though. And, oddly, the file with the F-spot message is actually the smaller of the two.
Visually the photos appear identical. I've tried comparing them in The GIMP by placing them in layers and subtracting the one layer from the other and looking for differences but, so far, I haven't been able to see any visual differences. (Of course, that might be me not performing the comparison in The GIMP correctly.)
Does anyone know what else might have changed beyond the "Date and Time" and "Software" fields in the EXIF information?
Since checksums aren't going to be a reliable means of finding identical photos, is there a better way to check if two photos are identical?
I have quite a few photos yet to migrate over to the new disk so if I could script this I'd prefer to deal with the migration that way. There are way too many to be comparing manually.
If the files are different of course the checksum (or signature) will be different.
What about if you remove all the exif data (using exiftool) and compare them ?. Set up a temp dir and work there on a few copies of files. Not confident if f-spot is changing the files, but you never know.
I too have truckloads of copies of photos, but I've never had an issue of differences. But then I probably haven't used f-spot this century ...
that's a bummer with he modified exif data, and i don't see any other possibiltity than discarding the possibly modified fields (sure there's command line software for that), then re-compare the resulting files.
what i don't understand:
aren't the filenames still the same, although those images reside in different folders?
Since checksums aren't going to be a reliable means of finding identical photos, is there a better way to check if two photos are identical?
I use Geeqie to do that. Basically, you open the folder containing the files you want examined for duplicates; hit the D key and a popup window comes up entitled "Find duplicates"; in that window, tick the thumbnail box and select "Similarity(high)" in the Compare by box; switch focus back on the Geeqie window and select all the photos with Ctrl-A; drag and drop onto the popup window you just configured and wait.
Last edited by kilgoretrout; 02-17-2018 at 08:25 AM.
I tried using the ``identify -verbose'' command on both files, saving the output into separate files , and ``diff''ing them. It reveals a lot more difference than the ``exif'' command's output did. Undoubtedly that additional information makes up the difference between the length of the F-spot version string and the 500-600 byte difference in file sizes. While it's a bit of a disappointment that my existing tool for flagging duplicate files seems to be ineffective when dealing with JPEG files, it shouldn't be a huge effort to make a copy of my tools and adapt it to use ``identify'' as the core of a tool just for images. It's something to do on a quiet Saturday night! Right?!)
Thanks for reminding me about ``identify''. I tend to forget about the ImageMagick tools.
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803
Original Poster
Rep:
Quote:
Originally Posted by syg00
If the files are different of course the checksum (or signature) will be different.
What about if you remove all the exif data (using exiftool) and compare them ?. Set up a temp dir and work there on a few copies of files. Not confident if f-spot is changing the files, but you never know.
I too have truckloads of copies of photos, but I've never had an issue of differences. But then I probably haven't used f-spot this century ...
The mismatched checksums of two files that had been identically renamed using their exposure times was the first thing that got me looking into what was going on. Once I saw that the file sizes were different it was obvious why the checksums would be as well. Fortunately, the "signature" that you see with ``identify'' only appears to represent the actual image portion of the JPEG file and, in the case of the two files that brought this to my attention, is identical for both images. It's the insertion of a ``Software'' field (containing the F-spot version) into EXIF information that F-spot did that was the first bit that I found that was invalidating my original checksum comparisons.
Anyway, I'm forging ahead using norobro's recommendation of using the information returned by ``identify''. Extracting the ``signature'' field from its report is a simple one-liner. It's pretty much a drop-in replacement for the SHA checksum I was using for comparisons my home-brew tool was doing previously so I avoid a complete re-write.
I was using another utility for image file management (name of which escapes me at the moment... doesn't matter anyway) and I only used F-spot for a brief time after an "upgrade" made F-spot the default image management tool---and the former software I'd been using seemed to disappear altogether. I don't use any of those type tools any more so the problem I'm seeing will only involve those that were touched by F-spot---a subset of the photos on my system.
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803
Original Poster
Rep:
Quote:
Originally Posted by kilgoretrout
I use Geeqie to do that. Basically, you open the folder containing the files you want examined for duplicates; hit the D key and a popup window comes up entitled "Find duplicates"; in that window, tick the thumbnail box and select "Similarity(high)" in the Compare by box; switch focus back on the Geeqie window and select all the photos with Ctrl-A; drag and drop onto the popup window you just configured and wait.
Sounds interesting. I'll have to give that utility a try.
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803
Original Poster
Rep:
Quote:
Originally Posted by ondoho
that's a bummer with he modified exif data, and i don't see any other possibiltity than discarding the possibly modified fields (sure there's command line software for that), then re-compare the resulting files.
what i don't understand:
aren't the filenames still the same, although those images reside in different folders?
I'm not keen on editing the files to remove EXIF information in any of the images. If I find duplicates based on the signatures that are reported by ``identify'' I'll use the existence of the "Software: f-stop..." information to tell me which file I want to overwrite.
Re: filenames and folders... Yes. Same filenames derived from exposure date/time. They're currently in different directories but the goal is to merge the two trees into one. (There are multiple trees but I'm only working with two at a time: the master tree on the big raidset and ones that are scattered around in various locations due to space limits at the time, moving home directories from disk to disk, etc. Yeah. It's messy. Amazing how messy things can get. One former home directory dates back to the early '00s.) The target tree structure is, in general, "~/photos/YYYY/MM/DD". My merge process was taking the filenames, for example: "YYYYMMDDhhmmsstt_img_NNNN.jpg, extracting the date portion and moving the file into the correct dated subdirectory. It was at that point I starting getting "Hey! File already exists!" messages which had me wondering about what's going on as my original file comparisons based on SHA checksums indicated that no duplicates existed in the two trees.
I'm not keen on editing the files to remove EXIF information in any of the images.
That's why I suggested a few test files - just to see if a checksum now works. Easy enough to script copy-and-strip.
And yes I know it'll thrash you hardware. In my case my best system is only for my photos (and driving sims ... ), so hammering the disks is of no consequence.
I'm not keen on editing the files to remove EXIF information in any of the images. If I find duplicates based on the signatures that are reported by ``identify'' I'll use the existence of the "Software: f-stop..." information to tell me which file I want to overwrite.
Re: filenames and folders... Yes. Same filenames derived from exposure date/time. They're currently in different directories but the goal is to merge the two trees into one. (There are multiple trees but I'm only working with two at a time: the master tree on the big raidset and ones that are scattered around in various locations due to space limits at the time, moving home directories from disk to disk, etc. Yeah. It's messy. Amazing how messy things can get. One former home directory dates back to the early '00s.) The target tree structure is, in general, "~/photos/YYYY/MM/DD". My merge process was taking the filenames, for example: "YYYYMMDDhhmmsstt_img_NNNN.jpg, extracting the date portion and moving the file into the correct dated subdirectory. It was at that point I starting getting "Hey! File already exists!" messages which had me wondering about what's going on as my original file comparisons based on SHA checksums indicated that no duplicates existed in the two trees.
Hope that made some sense.
Later...
--
Rick
getting info off exiftool : Example on how to figure out what words to use.
#just run an image using exiftool to see what fields you have to work with.
Code:
bash-4.3$ exiftool /media/data/wallpaper/selena-gomez/Selena_Gomez_Fetish-0898.png
ExifTool Version Number : 10.55
File Name : Selena_Gomez_Fetish-0898.png
Directory : /media/data/wallpaper/selena-gomez
File Size : 485 kB
File Modification Date/Time : 2017:10:25 11:53:10-05:00
File Access Date/Time : 2018:02:17 20:19:21-06:00
File Inode Change Date/Time : 2018:02:16 11:55:41-06:00
File Permissions : rwxr-xr-x
File Type : PNG
File Type Extension : png
MIME Type : image/png
Image Width : 960
Image Height : 720
Bit Depth : 8
Color Type : RGB with Alpha
now use the words on the left side and just put them together if there is two or more of them in this fashion.
Code:
// to see just one particular one on the cli.
exiftool -p '$BitDepth' path/filename
// to put it into a varitable
pixInfo=$(exiftool -p '$ImageHeight' path/filename)
echo $pixInfo
//to see what it is
you could put it in a script in a loop or off the cli if you want to run one or more t a time and not have to type in every path/filename one at a time.
I stole a script from here, based on a Stack Overflow question, made it work in python3 and used a older trick of mine to compare each image to each other and grouping the duplicates (ignoring files already considered duplicates).. instead of just comparing img1 and img2.
This compares images based on their greyscale value of how "similar" they are.
You'll need python3 and the scipy module to make it work (you can download anaconda to make this easier to install)
The idea is to lower the threshold as low as possible to match duplicate images.
The images have to have the same size to for comparison however.
No promises, but was a interesting script to hack into
usage: compare.py [-h] [--manhattan MANHATTAN] [--zero ZERO]
[--absolute-paths]
images [images ...]
positional arguments:
images
optional arguments:
-h, --help show this help message and exit
--manhattan MANHATTAN
Set threshold
--zero ZERO Set threshold
--absolute-paths
Consider using Manhattan norm (the sum of the absolute values) or zero norm (the number of elements not equal to zero) to measure how much the image has changed. The former will tell you how much the image is off, the latter will tell only how many pixels differ.
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803
Original Poster
Rep:
Quote:
Originally Posted by BW-userx
getting info off exiftool : Example on how to figure out what words to use.
[snip]
I found another way to get field names to use. run this on an image and it prints out the field format
Code:
exiftool -s -G filename.jpg
I modified my Perl script to extract that signature field from the `identify -verbose' report and, other than being slower than I remember the original version of the script running, it's detecting duplicates just fine now.
I might try fiddling with `exiftool' parameters to see if I can get a performance boost--I've seen complaints out on the 'net about `identify -verbose' being slow--but for now things are working. The nice thing about identify's signature, though, is that is works for other image file formats so I could extend the file extensions I have the script looking at to include things like TIFF, PNG, etc.
I modified my Perl script to extract that signature field from the `identify -verbose' report and, other than being slower than I remember the original version of the script running, it's detecting duplicates just fine now.
I might try fiddling with `exiftool' parameters to see if I can get a performance boost--I've seen complaints out on the 'net about `identify -verbose' being slow--but for now things are working. The nice thing about identify's signature, though, is that is works for other image file formats so I could extend the file extensions I have the script looking at to include things like TIFF, PNG, etc.
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803
Original Poster
Rep:
Quote:
Originally Posted by Sefyir
I stole a script from here, based on a Stack Overflow question, made it work in python3 and used a older trick of mine to compare each image to each other and grouping the duplicates (ignoring files already considered duplicates).. instead of just comparing img1 and img2.
Not exactly what I'm trying to accomplish with my current project (unless I use it with the idea that "similar" means 100% match) but this might come in handy for some other image work I was thinking of doing. That work might entail dealing with "similar" images but of different sizes so I'd have to figure out how that code might have to be tweaked to convert images to the same size in memory (using some interpolation technique) before doing any looking for similarity. (Man... this takes me back to my image processing days many, many moons ago.)
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.