rmlint or any other tool to remove common part of file systems

uyjjhak · 02-20-2020, 01:51 PM

Hello,
I would like to make order on my 30+ TB disk share ("messup").
There is a lot of duplicates in different locations. Directory trees are extremmely deep as that is about 10 years archive consists recoveries from any failure disks, unpacked images, etc etc.

I have some part of that data stored in "messup" kept ordered in another drive - ex external drive which can be treated as an reliable/confident storage - "source".

I have about 50 similar drives, so I am thinking about make that as repeatable process to get some deduplicated contents of "messup" data set

So I am looking for any tool which could find for any duplicates exists on "source" and exists in any other location in "messup".

BUT - if that tool will find any other duplicates in messup only - which are not exists in "source" it should ignore it, and do not list as a duplicate. (Not in that run. Later)

In short - I would like to find duplicates only in intersection of both file sets. Is that possible? How to do that ?

I thought about rmlint but I haven't find any understable switch for me.

I have only one idea, to manually use hashdeep and some bash loops in later removal process. That is extremmely manual job in my opinion, but I see as an only one roundabout:

Currently I am thinking about scan everything in "source" by hashdeep, and after that rescan "messup" set by hashdeep with known hashes from previous "source" scan. I would like to use -k known_files_from_source_scan.log and using matching mode in hashdeep when recurisvely scan "/messup":
Assuming that both sets are mounted in /messup and /source I am thinking about:
1. hashdeep -r /source | tee -a known_files_from_source_scan.log
2. hashdeep -k known_files_from_source_scan.log -m -r /messup | tee -a messup_duplicates_to_remove.log
3. batch remove files listed in messup_duplicates_to_remove.log

And I am looking for any help from you, to make that process more automatic, if that is possible...

syg00 · 02-22-2020, 01:10 AM

There are a bunch of duplicate removal tools - I've not looked at rmlint, but I did have it on my "to-do" list a while back. Looks very comprehensive. This is zero chance you will find a tool that does exactly what you want. Any tool (rmlint for example) that produces a list should allow you to knock up a script to delete the lines you don't want. That way you can re-run it as often as needed. Has to be a (much) better option that what you proposed yourself.
I'm a big fan of getting a script or file-list that I can check rather than allowing the tool to delete the files.

uyjjhak · 03-23-2020, 10:52 AM

Quote:

Originally Posted by syg00

There are a bunch of duplicate removal tools - I've not looked at rmlint, but I did have it on my "to-do" list a while back. Looks very comprehensive. This is zero chance you will find a tool that does exactly what you want. Any tool (rmlint for example) that produces a list should allow you to knock up a script to delete the lines you don't want. That way you can re-run it as often as needed. Has to be a (much) better option that what you proposed yourself.
I'm a big fan of getting a script or file-list that I can check rather than allowing the tool to delete the files.

Thank you for your answer and sorry for long time needed to answer...

My current method is grepping out from rmlint.sh (output script from rmlint) values which contain strings which I would need to stay safe. But when I am scanning thousands of files it is really hard to check if everything was propperly secured.

I have seen some piping and find usage tools in rmlint man - but I cannot get it working as I do not have much experience with parsing these streams.

Maybe someone used it before ?