LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 02-20-2020, 01:51 PM   #1
uyjjhak
LQ Newbie
 
Registered: Feb 2020
Posts: 11

Rep: Reputation: Disabled
rmlint or any other tool to remove common part of file systems


Hello,
I would like to make order on my 30+ TB disk share ("messup").
There is a lot of duplicates in different locations. Directory trees are extremmely deep as that is about 10 years archive consists recoveries from any failure disks, unpacked images, etc etc.

I have some part of that data stored in "messup" kept ordered in another drive - ex external drive which can be treated as an reliable/confident storage - "source".

I have about 50 similar drives, so I am thinking about make that as repeatable process to get some deduplicated contents of "messup" data set

So I am looking for any tool which could find for any duplicates exists on "source" and exists in any other location in "messup".

BUT - if that tool will find any other duplicates in messup only - which are not exists in "source" it should ignore it, and do not list as a duplicate. (Not in that run. Later)

In short - I would like to find duplicates only in intersection of both file sets. Is that possible? How to do that ?

I thought about rmlint but I haven't find any understable switch for me.


I have only one idea, to manually use hashdeep and some bash loops in later removal process. That is extremmely manual job in my opinion, but I see as an only one roundabout:

Currently I am thinking about scan everything in "source" by hashdeep, and after that rescan "messup" set by hashdeep with known hashes from previous "source" scan. I would like to use -k known_files_from_source_scan.log and using matching mode in hashdeep when recurisvely scan "/messup":
Assuming that both sets are mounted in /messup and /source I am thinking about:
1. hashdeep -r /source | tee -a known_files_from_source_scan.log
2. hashdeep -k known_files_from_source_scan.log -m -r /messup | tee -a messup_duplicates_to_remove.log
3. batch remove files listed in messup_duplicates_to_remove.log

And I am looking for any help from you, to make that process more automatic, if that is possible...
 
Old 02-22-2020, 01:10 AM   #2
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,152

Rep: Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125Reputation: 4125
There are a bunch of duplicate removal tools - I've not looked at rmlint, but I did have it on my "to-do" list a while back. Looks very comprehensive. This is zero chance you will find a tool that does exactly what you want. Any tool (rmlint for example) that produces a list should allow you to knock up a script to delete the lines you don't want. That way you can re-run it as often as needed. Has to be a (much) better option that what you proposed yourself.
I'm a big fan of getting a script or file-list that I can check rather than allowing the tool to delete the files.
 
Old 03-23-2020, 10:52 AM   #3
uyjjhak
LQ Newbie
 
Registered: Feb 2020
Posts: 11

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by syg00 View Post
There are a bunch of duplicate removal tools - I've not looked at rmlint, but I did have it on my "to-do" list a while back. Looks very comprehensive. This is zero chance you will find a tool that does exactly what you want. Any tool (rmlint for example) that produces a list should allow you to knock up a script to delete the lines you don't want. That way you can re-run it as often as needed. Has to be a (much) better option that what you proposed yourself.
I'm a big fan of getting a script or file-list that I can check rather than allowing the tool to delete the files.
Thank you for your answer and sorry for long time needed to answer...

My current method is grepping out from rmlint.sh (output script from rmlint) values which contain strings which I would need to stay safe. But when I am scanning thousands of files it is really hard to check if everything was propperly secured.

I have seen some piping and find usage tools in rmlint man - but I cannot get it working as I do not have much experience with parsing these streams.

Maybe someone used it before ?
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
list directories where all foles are dumpicated (rmlint) funkytwig Linux - General 6 04-01-2019 02:15 PM
LXer: How To Empty a File, Delete N Lines From a File, Remove Matching String From a File, And Remove Empty/Blank Lines From a File In Linux LXer Syndicated Linux News 0 11-22-2017 12:30 PM
rmlint-2.0.0 - a lint/duplicate finder [rewrite of old rmlint, testers wanted] sahib_bommelig Linux - General 13 10-25-2015 09:55 AM
LXer: Linux vs Other Operating Systems : 7 common myths busted LXer Syndicated Linux News 0 05-22-2011 11:21 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 03:20 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration