rmlint or any other tool to remove common part of file systems
Hello,
I would like to make order on my 30+ TB disk share ("messup").
There is a lot of duplicates in different locations. Directory trees are extremmely deep as that is about 10 years archive consists recoveries from any failure disks, unpacked images, etc etc.
I have some part of that data stored in "messup" kept ordered in another drive - ex external drive which can be treated as an reliable/confident storage - "source".
I have about 50 similar drives, so I am thinking about make that as repeatable process to get some deduplicated contents of "messup" data set
So I am looking for any tool which could find for any duplicates exists on "source" and exists in any other location in "messup".
BUT - if that tool will find any other duplicates in messup only - which are not exists in "source" it should ignore it, and do not list as a duplicate. (Not in that run. Later)
In short - I would like to find duplicates only in intersection of both file sets. Is that possible? How to do that ?
I thought about rmlint but I haven't find any understable switch for me.
I have only one idea, to manually use hashdeep and some bash loops in later removal process. That is extremmely manual job in my opinion, but I see as an only one roundabout:
Currently I am thinking about scan everything in "source" by hashdeep, and after that rescan "messup" set by hashdeep with known hashes from previous "source" scan. I would like to use -k known_files_from_source_scan.log and using matching mode in hashdeep when recurisvely scan "/messup":
Assuming that both sets are mounted in /messup and /source I am thinking about:
1. hashdeep -r /source | tee -a known_files_from_source_scan.log
2. hashdeep -k known_files_from_source_scan.log -m -r /messup | tee -a messup_duplicates_to_remove.log
3. batch remove files listed in messup_duplicates_to_remove.log
And I am looking for any help from you, to make that process more automatic, if that is possible...
|