LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 02-16-2018, 04:04 PM   #1
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Comparing JPEGs and finding matches... or not finding matches.


Here's one for someone better versed on the innards of JPEG files than I am.

The problem: I have multiple digital cameras and, rather than keep photos from each in separate directory trees, went about renaming each file using the EXIF exposure date/time information with the original filename tacked on the end. (The theory being that it would be highly unlikely I would have two different cameras having the same 4-digit exposure number in use at the same time while I'm pressing the shutter releases at exactly the same time.) Ahh... if only cameras gave you the opportunity to select something other than "IMG" as a filename prefix.

What I recently found: I have in the past gotten sloppy and offloaded photos from cameras without reformatting the card (or otherwise deleting the photos on the memory cards). Hence, when I offloaded the next bunch of photos into another directory I would, obviously, have two copies of some of the photos in different locations. Plus, a large part of this problem arose as I was running out of disk space so photos were getting offloaded to wherever I had free space until I purchased a pair of giant disks to make a dedicated raidset for photos. I figured that my renaming process would rename the files identically to the names previously generated for the first copies of the files offloaded from the camera---and it does. The plan was to merge all the various directory trees of photos into one directory tree on the big raidset.

Now here's where things get mysterious. If I obtain a checksum for two photos that have the same "new" filename based on the EXIF information, I find that they don't match. I find that in these cases, the photos have the same dimensions but have file sizes that are slightly different, i.e., not really identical. So... I ran `exif' on each photo and found that one of them had slightly different information. One photo had a "Software" item in the EXIF information dump that mentions Gnome F-spot. (Gee, thanks a lot Gnome F-spot... You were supposed to organize the photos; not modify them.) The "Date and Time" field--seen in both photos' EXIF information--also seems to record the date/time that F-spot touched the file as well. The size of that field doesn't correspond to the file size differences, though. And, oddly, the file with the F-spot message is actually the smaller of the two.

Visually the photos appear identical. I've tried comparing them in The GIMP by placing them in layers and subtracting the one layer from the other and looking for differences but, so far, I haven't been able to see any visual differences. (Of course, that might be me not performing the comparison in The GIMP correctly.)
  • Does anyone know what else might have changed beyond the "Date and Time" and "Software" fields in the EXIF information?
  • Since checksums aren't going to be a reliable means of finding identical photos, is there a better way to check if two photos are identical?

I have quite a few photos yet to migrate over to the new disk so if I could script this I'd prefer to deal with the migration that way. There are way too many to be comparing manually.

TIA...

--
Rick
 
Old 02-16-2018, 05:55 PM   #2
norobro
Member
 
Registered: Feb 2006
Distribution: Debian Sid
Posts: 792

Rep: Reputation: 331Reputation: 331Reputation: 331Reputation: 331
This might be a starting point: https://www.imagemagick.org/script/identify.php

I cut a small rectangle out of an image with gimp and saved the changed file.
Code:
$ identify -verbose original.jpg |grep signature
    signature: c9a737ee46ea909f10c446222e641a40aaa7966778a1e297c2f8e6d2f7e125c4

$ identify -verbose modified.jpg |grep signature
    signature: c4d70f22020b609390634f9000f65b764c63e223127274a5d41eae8c1e6b846a
HTH

Last edited by norobro; 02-16-2018 at 07:42 PM. Reason: typo
 
Old 02-17-2018, 03:12 AM   #3
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,131

Rep: Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121
If the files are different of course the checksum (or signature) will be different.
What about if you remove all the exif data (using exiftool) and compare them ?. Set up a temp dir and work there on a few copies of files. Not confident if f-spot is changing the files, but you never know.

I too have truckloads of copies of photos, but I've never had an issue of differences. But then I probably haven't used f-spot this century ...
 
Old 02-17-2018, 04:53 AM   #4
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
that's a bummer with he modified exif data, and i don't see any other possibiltity than discarding the possibly modified fields (sure there's command line software for that), then re-compare the resulting files.

what i don't understand:
aren't the filenames still the same, although those images reside in different folders?
 
Old 02-17-2018, 08:19 AM   #5
kilgoretrout
Senior Member
 
Registered: Oct 2003
Posts: 2,987

Rep: Reputation: 388Reputation: 388Reputation: 388Reputation: 388
Quote:
Since checksums aren't going to be a reliable means of finding identical photos, is there a better way to check if two photos are identical?
I use Geeqie to do that. Basically, you open the folder containing the files you want examined for duplicates; hit the D key and a popup window comes up entitled "Find duplicates"; in that window, tick the thumbnail box and select "Similarity(high)" in the Compare by box; switch focus back on the Geeqie window and select all the photos with Ctrl-A; drag and drop onto the popup window you just configured and wait.

Last edited by kilgoretrout; 02-17-2018 at 08:25 AM.
 
Old 02-17-2018, 04:01 PM   #6
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803

Original Poster
Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Code:
$ identify -verbose original.jpg |grep signature
    signature: c9a737ee46ea909f10c446222e641a40aaa7966778a1e297c2f8e6d2f7e125c4

$ identify -verbose modified.jpg |grep signature
    signature: c4d70f22020b609390634f9000f65b764c63e223127274a5d41eae8c1e6b846a
HTH
I tried using the ``identify -verbose'' command on both files, saving the output into separate files , and ``diff''ing them. It reveals a lot more difference than the ``exif'' command's output did. Undoubtedly that additional information makes up the difference between the length of the F-spot version string and the 500-600 byte difference in file sizes. While it's a bit of a disappointment that my existing tool for flagging duplicate files seems to be ineffective when dealing with JPEG files, it shouldn't be a huge effort to make a copy of my tools and adapt it to use ``identify'' as the core of a tool just for images. It's something to do on a quiet Saturday night! Right?!)

Thanks for reminding me about ``identify''. I tend to forget about the ImageMagick tools.

--
Rick
 
Old 02-17-2018, 07:22 PM   #7
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803

Original Poster
Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by syg00 View Post
If the files are different of course the checksum (or signature) will be different.
What about if you remove all the exif data (using exiftool) and compare them ?. Set up a temp dir and work there on a few copies of files. Not confident if f-spot is changing the files, but you never know.

I too have truckloads of copies of photos, but I've never had an issue of differences. But then I probably haven't used f-spot this century ...
The mismatched checksums of two files that had been identically renamed using their exposure times was the first thing that got me looking into what was going on. Once I saw that the file sizes were different it was obvious why the checksums would be as well. Fortunately, the "signature" that you see with ``identify'' only appears to represent the actual image portion of the JPEG file and, in the case of the two files that brought this to my attention, is identical for both images. It's the insertion of a ``Software'' field (containing the F-spot version) into EXIF information that F-spot did that was the first bit that I found that was invalidating my original checksum comparisons.

Anyway, I'm forging ahead using norobro's recommendation of using the information returned by ``identify''. Extracting the ``signature'' field from its report is a simple one-liner. It's pretty much a drop-in replacement for the SHA checksum I was using for comparisons my home-brew tool was doing previously so I avoid a complete re-write.

I was using another utility for image file management (name of which escapes me at the moment... doesn't matter anyway) and I only used F-spot for a brief time after an "upgrade" made F-spot the default image management tool---and the former software I'd been using seemed to disappear altogether. I don't use any of those type tools any more so the problem I'm seeing will only involve those that were touched by F-spot---a subset of the photos on my system.
 
Old 02-17-2018, 07:23 PM   #8
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803

Original Poster
Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by kilgoretrout View Post
I use Geeqie to do that. Basically, you open the folder containing the files you want examined for duplicates; hit the D key and a popup window comes up entitled "Find duplicates"; in that window, tick the thumbnail box and select "Similarity(high)" in the Compare by box; switch focus back on the Geeqie window and select all the photos with Ctrl-A; drag and drop onto the popup window you just configured and wait.
Sounds interesting. I'll have to give that utility a try.
 
Old 02-17-2018, 07:44 PM   #9
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803

Original Poster
Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by ondoho View Post
that's a bummer with he modified exif data, and i don't see any other possibiltity than discarding the possibly modified fields (sure there's command line software for that), then re-compare the resulting files.

what i don't understand:
aren't the filenames still the same, although those images reside in different folders?
I'm not keen on editing the files to remove EXIF information in any of the images. If I find duplicates based on the signatures that are reported by ``identify'' I'll use the existence of the "Software: f-stop..." information to tell me which file I want to overwrite.

Re: filenames and folders... Yes. Same filenames derived from exposure date/time. They're currently in different directories but the goal is to merge the two trees into one. (There are multiple trees but I'm only working with two at a time: the master tree on the big raidset and ones that are scattered around in various locations due to space limits at the time, moving home directories from disk to disk, etc. Yeah. It's messy. Amazing how messy things can get. One former home directory dates back to the early '00s.) The target tree structure is, in general, "~/photos/YYYY/MM/DD". My merge process was taking the filenames, for example: "YYYYMMDDhhmmsstt_img_NNNN.jpg, extracting the date portion and moving the file into the correct dated subdirectory. It was at that point I starting getting "Hey! File already exists!" messages which had me wondering about what's going on as my original file comparisons based on SHA checksums indicated that no duplicates existed in the two trees.


Hope that made some sense.

Later...

--
Rick
 
Old 02-17-2018, 07:52 PM   #10
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,131

Rep: Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121
Quote:
Originally Posted by rnturn View Post
I'm not keen on editing the files to remove EXIF information in any of the images.
That's why I suggested a few test files - just to see if a checksum now works. Easy enough to script copy-and-strip.
And yes I know it'll thrash you hardware. In my case my best system is only for my photos (and driving sims ... ), so hammering the disks is of no consequence.
 
Old 02-17-2018, 08:29 PM   #11
BW-userx
LQ Guru
 
Registered: Sep 2013
Location: Somewhere in my head.
Distribution: Slackware (15 current), Slack15, Ubuntu studio, MX Linux, FreeBSD 13.1, WIn10
Posts: 10,342

Rep: Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242
Quote:
Originally Posted by rnturn View Post
I'm not keen on editing the files to remove EXIF information in any of the images. If I find duplicates based on the signatures that are reported by ``identify'' I'll use the existence of the "Software: f-stop..." information to tell me which file I want to overwrite.

Re: filenames and folders... Yes. Same filenames derived from exposure date/time. They're currently in different directories but the goal is to merge the two trees into one. (There are multiple trees but I'm only working with two at a time: the master tree on the big raidset and ones that are scattered around in various locations due to space limits at the time, moving home directories from disk to disk, etc. Yeah. It's messy. Amazing how messy things can get. One former home directory dates back to the early '00s.) The target tree structure is, in general, "~/photos/YYYY/MM/DD". My merge process was taking the filenames, for example: "YYYYMMDDhhmmsstt_img_NNNN.jpg, extracting the date portion and moving the file into the correct dated subdirectory. It was at that point I starting getting "Hey! File already exists!" messages which had me wondering about what's going on as my original file comparisons based on SHA checksums indicated that no duplicates existed in the two trees.


Hope that made some sense.

Later...

--
Rick
getting info off exiftool : Example on how to figure out what words to use.

#just run an image using exiftool to see what fields you have to work with.
Code:
bash-4.3$ exiftool /media/data/wallpaper/selena-gomez/Selena_Gomez_Fetish-0898.png
ExifTool Version Number         : 10.55
File Name                       : Selena_Gomez_Fetish-0898.png
Directory                       : /media/data/wallpaper/selena-gomez
File Size                       : 485 kB
File Modification Date/Time     : 2017:10:25 11:53:10-05:00
File Access Date/Time           : 2018:02:17 20:19:21-06:00
File Inode Change Date/Time     : 2018:02:16 11:55:41-06:00
File Permissions                : rwxr-xr-x
File Type                       : PNG
File Type Extension             : png
MIME Type                       : image/png
Image Width                     : 960
Image Height                    : 720
Bit Depth                       : 8
Color Type                      : RGB with Alpha
now use the words on the left side and just put them together if there is two or more of them in this fashion.

Code:
// to see just one particular one on the cli.
exiftool -p '$BitDepth' path/filename
// to put it into a varitable
pixInfo=$(exiftool -p '$ImageHeight' path/filename)
echo $pixInfo
//to see what it is
you could put it in a script in a loop or off the cli if you want to run one or more t a time and not have to type in every path/filename one at a time.

for date time stuff
Code:
bash-4.3$ exiftool -p '$FileModifyDate' "/media/data/wallpaper/selena-gomez/Selena_Gomez_Fetish-0898.png"
2017:10:25 11:53:10-05:00
oh I just found this, you might want to read it pertaining to date and time that shows up on the image info
https://ninedegreesbelow.com/photogr...-commands.html

I found another way to get field names to use. run this on an image and it prints out the field format
Code:
exiftool -s -G    filename.jpg

Last edited by BW-userx; 02-17-2018 at 08:56 PM.
 
Old 02-17-2018, 09:47 PM   #12
Sefyir
Member
 
Registered: Mar 2015
Distribution: Linux Mint
Posts: 634

Rep: Reputation: 316Reputation: 316Reputation: 316Reputation: 316
I stole a script from here, based on a Stack Overflow question, made it work in python3 and used a older trick of mine to compare each image to each other and grouping the duplicates (ignoring files already considered duplicates).. instead of just comparing img1 and img2.
This compares images based on their greyscale value of how "similar" they are.

You'll need python3 and the scipy module to make it work (you can download anaconda to make this easier to install)
The idea is to lower the threshold as low as possible to match duplicate images.
The images have to have the same size to for comparison however.

No promises, but was a interesting script to hack into

Script: https://gist.github.com/anonymous/a7...93d341897fde64

Code:
usage: compare.py [-h] [--manhattan MANHATTAN] [--zero ZERO]
                  [--absolute-paths]
                  images [images ...]

positional arguments:
  images

optional arguments:
  -h, --help            show this help message and exit
  --manhattan MANHATTAN
                        Set threshold
  --zero ZERO           Set threshold
  --absolute-paths
Manhattan and Zero?
Quote:
Consider using Manhattan norm (the sum of the absolute values) or zero norm (the number of elements not equal to zero) to measure how much the image has changed. The former will tell you how much the image is off, the latter will tell only how many pixels differ.
Example usage:

Code:
./compare.py --manhattan 40 $(find images -iname \*png)
images/sedona_right_01.png <-> images/sedona_right_01.png
  Manhattan:0.0,
  Zero:0.0
images/sedona_right_01.png <-> images/sedona_left_01.png
  Manhattan:38.94154426511395,
  Zero:1.0
images/sedona_right_01.png <-> images/scottsdale_right_01.png
  Manhattan:68.13498262862784,
  Zero:1.0
...
...
images/sedona_right_01.png <-> images/scottsdale_left_01.png
  Manhattan:74.20768358134397,
  Zero:1.0
images/sedona_right_01.png,images/sedona_left_01.png
images/grand_canyon_right_02.png,images/grand_canyon_left_02.png
images/bryce_right_03.png,images/bryce_right_01.png
images/bryce_right_02.png,images/bryce_left_02.png
images/bryce_left_03.png,images/bryce_left_01.png
You can filter out the information about each image comparision (<->) by filtering stderr (2> /dev/null)

Last edited by Sefyir; 02-17-2018 at 09:52 PM.
 
Old 02-17-2018, 10:45 PM   #13
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803

Original Poster
Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by BW-userx View Post
getting info off exiftool : Example on how to figure out what words to use.

[snip]

I found another way to get field names to use. run this on an image and it prints out the field format
Code:
exiftool -s -G    filename.jpg
I modified my Perl script to extract that signature field from the `identify -verbose' report and, other than being slower than I remember the original version of the script running, it's detecting duplicates just fine now.

I might try fiddling with `exiftool' parameters to see if I can get a performance boost--I've seen complaints out on the 'net about `identify -verbose' being slow--but for now things are working. The nice thing about identify's signature, though, is that is works for other image file formats so I could extend the file extensions I have the script looking at to include things like TIFF, PNG, etc.
 
Old 02-17-2018, 10:55 PM   #14
BW-userx
LQ Guru
 
Registered: Sep 2013
Location: Somewhere in my head.
Distribution: Slackware (15 current), Slack15, Ubuntu studio, MX Linux, FreeBSD 13.1, WIn10
Posts: 10,342

Rep: Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242
Quote:
Originally Posted by rnturn View Post
I modified my Perl script to extract that signature field from the `identify -verbose' report and, other than being slower than I remember the original version of the script running, it's detecting duplicates just fine now.

I might try fiddling with `exiftool' parameters to see if I can get a performance boost--I've seen complaints out on the 'net about `identify -verbose' being slow--but for now things are working. The nice thing about identify's signature, though, is that is works for other image file formats so I could extend the file extensions I have the script looking at to include things like TIFF, PNG, etc.
exiftool is of perl
http://search.cpan.org/~exiftool/Ima...e/ExifTool.pod

I use it mostly for getting info off mp3 in bash scripts. but in that link it shows extensions, png, jpg etc..
old way
Code:
  SongName="`exiftool  -Title  "$FILENAME" -p '$Title'`"
different way, or new way I am doing it.
Code:
pixInfo=$(exiftool -p '$ImageHeight' path/filename)
that format I posted I have found to be faster than the old way I was doing it.

But being perl-exiftool inside of a perl script, I'd think they would complement each other.

file types:
http://search.cpan.org/~exiftool/Ima...l/TagNames.pod

Last edited by BW-userx; 02-17-2018 at 11:26 PM.
 
Old 02-17-2018, 11:10 PM   #15
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803

Original Poster
Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by Sefyir View Post
I stole a script from here, based on a Stack Overflow question, made it work in python3 and used a older trick of mine to compare each image to each other and grouping the duplicates (ignoring files already considered duplicates).. instead of just comparing img1 and img2.
Not exactly what I'm trying to accomplish with my current project (unless I use it with the idea that "similar" means 100% match) but this might come in handy for some other image work I was thinking of doing. That work might entail dealing with "similar" images but of different sizes so I'd have to figure out how that code might have to be tweaked to convert images to the same size in memory (using some interpolation technique) before doing any looking for similarity. (Man... this takes me back to my image processing days many, many moons ago.)
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] AWK Print Matches And NR sweeny_here Programming 4 02-21-2017 04:30 PM
[SOLVED] Gawk - regexp [A-Z] matches [a-z]. How is this possible? b.lundblad@fabula.se Programming 5 08-31-2012 11:15 AM
LXer: Finding Overlapping Matches Using Perl's Lookahead Assertion Matching On Linux LXer Syndicated Linux News 0 09-09-2008 08:11 AM
yum = No Matches found jgibz Linux - Newbie 2 03-27-2005 12:38 PM
bash: routine outputting both matches and non-matches separately??? Bebo Programming 8 07-19-2004 06:52 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 05:49 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration