LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 07-16-2004, 02:20 PM   #1
Bebo
Member
 
Registered: Jul 2003
Location: Göteborg
Distribution: Arch Linux (current)
Posts: 553

Rep: Reputation: 31
bash: routine outputting both matches and non-matches separately???


Hello,

I'm working on a script where I grep large files against each other, and where I need to use both the matches and the non-matches. Of course this can be done with two lines, like
Code:
grep -f one_file another_file > matches
grep -vf one_file another_file > non-matches
But, since the two files are pretty large, and since I do this kind of matching lots of times in a loop, this takes too long. Therefore it would be very nice if there actually was a command/magic pipe that did this, so that it would suffice to do something like
Code:
unkowncommand -f one_file another_file matches non-matches
I've done a ton of searching and come up with basically nothing. I found some strange utilization of tee here, but couldn't get the "process substitution" to work. (I'm probably just stupid... :) And what should it be? >:(commandline) or >(commandline)? I got neither to work.) One would also think that csplit would do the trick, but it seems to be no good for splitting a file the way I want to do it.

It would be very easy to program something like this myself, but then my script wouldn't be portable at all. So, does anyone have an idea or even a solution to this problem? Please don't say that I have to do the mkfifo stuff mentioned on the site I linked to ;)

Thank you for your attention.

Cheers
 
Old 07-16-2004, 02:58 PM   #2
keefaz
LQ Guru
 
Registered: Mar 2004
Distribution: Slackware
Posts: 6,552

Rep: Reputation: 872Reputation: 872Reputation: 872Reputation: 872Reputation: 872Reputation: 872Reputation: 872
Could you explain more, because, as I get it, I would think the second grep in not needed as the first grep determines if paterns from one_file matches or not text in the another_file.
 
Old 07-16-2004, 05:38 PM   #3
osvaldomarques
Member
 
Registered: Jul 2004
Location: Rio de Janeiro - Brazil
Distribution: Conectiva 10 - Conectiva 8 - Slackware 9 - starting with LFS
Posts: 519

Rep: Reputation: 34
Mr Bebo,

Do you want to get a list of files which match(/doesn't match) your "regular expression" or you want to obtain 2 files, one with the matches and the other with the "doesn't match"?
Depending on your answer, in the first case, I would suggest you to use
Code:
if [ "`grep -c -f reference_file target_file`" = "0" ]; then
   echo "target_file" >>dont_match_list
else
   echo "target_file" >> matches_list
fi
Otherwise, if you want to get the lines of the files which match in one file and the remaining lines in another file, I would recommend to use awk or perl. I am not fluent in perl, but it is possible to do it. You will have to open two output files and execute the regular expression into the logic oft the language. Depending on the result, True or False, you will print the line in one of the files.
The "tee" command does not serve to you needs because its function is to archive temporary results in other file. Its name and purpose comes from the plumbing hardware. As after a grep you will have only the match(/doesn´t match) all you can store is one side from your need.
 
Old 07-16-2004, 06:29 PM   #4
Bebo
Member
 
Registered: Jul 2003
Location: Göteborg
Distribution: Arch Linux (current)
Posts: 553

Original Poster
Rep: Reputation: 31
osvaldomarques and keefaz,

Thanks for your swift answering of my call for aid Sorry for being unclear. And my choice of filenames were really bad; they were perfect for creating misunderstandings...

Never mind the two files for a while. Let's say I have one really long file, called longfile.txt. I want to grep out those lines containing a certain string, namely specialstring. This is of course an ordinary grep:
Code:
grep specialstring longfile.txt
OK, this far all is good and well, but what if I also want to see the lines in longfile.txt that do not start with specialstring? The obvious solution is to just put a -v in the grep'ing. So, for me to be able to see both matching lines and not matching lines, I need two grep's:
Code:
grep specialstring longfile.txt > matchinglines.txt
grep -v specialstring longfile.txt > notmatchinglines.txt
Now, let's say that I want to match the lines in longfile.txt against a list of special strings. If this list is called specialstrings.txt, then I could just do
Code:
grep -f specialstrings.txt longfile.txt
to get the matching lines, and of course the -v to get the not matching lines.

But it's not that simple: the files longfile.txt and specialstrings.txt are very long, and what's even worse is that I have many longfile.txt and specialstrings.txt files. I have to do it in a loop, like this:
Code:
for SPECIALSTRINGFILE in /blah/bleh/* ; do
   grep -f $SPECIALSTRINGFILE longfile.txt > matchinglines.txt
   grep -vf $SPECIALSTRINGFILE longfile.txt > notmatchinglines.txt

   ...do stuff with the matchinglines.txt and notmatchinglines.txt...
done
I can live with one grep - it's necessary - but it just takes too long with two grep lines!

...hence my question. And you're perfectly right on tee - it doesn't do me any good. I have started thinking of using awk or even sed, actually, cause I want it in a bash script, but I'm not that good at either, so I was hoping for some help on this

Actually, I find it odd that it doesn't seem to exist any command for this kind of problem. The closest one I've found is csplit, but it seems pretty useless for this problem.

I hope I've made it clearer what I'm looking for, and excuse me for any obviousness And thanks again.
 
Old 07-16-2004, 09:53 PM   #5
osvaldomarques
Member
 
Registered: Jul 2004
Location: Rio de Janeiro - Brazil
Distribution: Conectiva 10 - Conectiva 8 - Slackware 9 - starting with LFS
Posts: 519

Rep: Reputation: 34
Several years ago I needed to get a big file (+- 50k lines) and identify each line based in its literal content in an "accounting plan". I don't know the correct name in English; for you to give an idea, suppose you have the word "tree" in a line and you have to give it an account "1.10.01", if you have "hardware" you need to give an account "2.33.21" and so on. This was the time of the 386/8MB. This script ran for 38 hours, but it did it.
I understood your needs but I don't know the time frame and frequency you have to do it. If it's a job for once in the life, using general tools may solve your problem. If you have to do it everyday, maybe it's best to write a c program to do it.

Using general tools, maybe all you need is to create an output file with all the matches marked, sort it and cut it in 2 files based on the mark. In this case, instead of a file containing the patterns to search for, you will have an script where you will insert commands to make the substitution. First we need to make some room in the lines to be sure this is our match mark. For example,
Code:
#!/bin/sh
cat /var/log/messages | \
sed 's/^/ :/g' | \
sed 's/^\([^Y]\)\(:.*kernel.*\)$/Y\2/g' | \
sed 's/^\([^Y]\)\(:.*dhcpd.*\)$/Y\2/g' | \
tee subst.txt | \
sort | \
tee sort.txt | \
awk '
{
  if (substr($0, 1, 1) == "Y")
    printf("%s\n", substr($0, 3)) >"matches.txt" 
  else
    printf("%s\n", substr($0, 3)) >"dont-match.txt"
}'
I did this script to read /var/log/messages and look for lines containing "kernel" and "dhcpd" as matches and the rest to go to "dont-match". My log have 10k lines and is a flash. If I had the athlon that time, life would not be so venturous.
In this example, all you need is to insert more lines with the regular expressions you need. Note that sed look for lines which does not start with Y to check the rest of the expression. You need to construct your expression after the colon ":" and end it before the backslash-parenthesis. If you have some expression which contains slash (/), you can substitute the "containers" (par-me my ignorance for jargon names) oft the sed by any non alphabetical character, for example %, @ = +, etc. This (containers/separators) occurs after the "s", between the "$" and the "Y" and before the "g".
Your expression may be as complex as you need but must contemplate all the line, because we will substitute the expression found in the output.

I hope this helps.

Last edited by osvaldomarques; 07-16-2004 at 09:56 PM.
 
Old 07-17-2004, 10:48 AM   #6
Bebo
Member
 
Registered: Jul 2003
Location: Göteborg
Distribution: Arch Linux (current)
Posts: 553

Original Poster
Rep: Reputation: 31
Aha, this is very good. I didn't think of putting a marker on the matching lines. I think that even csplit might be useful here, replacing the awk part, since csplit splits the file right at the first match, which makes it useful in sorted files. BTW, I think you can skip the "g" at the end of the sed lines: sed 's/blah/bleh/g'. The g makes sed find all matching parts of a line, not just the first match. But since there is only one start of a line - i.e. one ^ - the g is superfluous.

Great! Thanks!

EDIT: Oh, BTW, do you know how I can use this method with a file containing the interesting strings? Your reference_file in your first post. Without a loop...


Last edited by Bebo; 07-17-2004 at 11:17 AM.
 
Old 07-17-2004, 01:44 PM   #7
osvaldomarques
Member
 
Registered: Jul 2004
Location: Rio de Janeiro - Brazil
Distribution: Conectiva 10 - Conectiva 8 - Slackware 9 - starting with LFS
Posts: 519

Rep: Reputation: 34
Hi Bebo,
There is no omelet without breaking eggs. Again, I don't know the size of your data neither the number of special strings, the frequency of your processing. But, we all know that the loop can be very inefficient. In my last post I suggested instead of a simple string file you have that script with one "sed" for each string. If your string set is variable, I would suggest a pre-processing phase which read you string input and compose the script before submitting your file for the matches.
 
Old 07-18-2004, 08:39 PM   #8
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
I think this calls for a couple of sed scripts. Actually, It wouldn't be hard to use sed to produce the needed sed scripts.

Assuming that strings that you need to match don't contain wild card characters like '.' '*', etc, this would produce the sed script from the 'longfile.txt' file.
sed 's#\(^.*\)#/\1/p#p' stringfile | uniq >matches.sed
sed 's#\(^.*\)#/\1/p#d' stringfile | uniq >nomatches.sed

After this you could produce your matched lines list like this:
sed -n -f matches.sed longfile.txt
and your unmatched lines list like this:
sed -f nomatches.sed longfile.txt
 
Old 07-19-2004, 06:52 AM   #9
Bebo
Member
 
Registered: Jul 2003
Location: Göteborg
Distribution: Arch Linux (current)
Posts: 553

Original Poster
Rep: Reputation: 31
Thanks a lot guys for your help. I think I have some testing to do now - we'll see what will be fastest
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
sed to extract multiple matches in a line? mhoch3 Linux - Software 8 08-01-2005 03:32 PM
yum = No Matches found jgibz Linux - Newbie 2 03-27-2005 12:38 PM
vpn only when destination matches given subnet colin.mca Linux - Networking 0 03-18-2004 03:29 AM
iptables - dscp not matches brabard Linux - Networking 9 10-16-2003 12:08 PM
sed - multiple matches on the same line mjoc27x Programming 6 04-17-2003 07:22 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:24 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration