bash: routine outputting both matches and non-matches separately???

Bebo · 07-16-2004, 02:20 PM

Hello,

I'm working on a script where I grep large files against each other, and where I need to use both the matches and the non-matches. Of course this can be done with two lines, like

Code:

grep -f one_file another_file > matches
grep -vf one_file another_file > non-matches

But, since the two files are pretty large, and since I do this kind of matching lots of times in a loop, this takes too long. Therefore it would be very nice if there actually was a command/magic pipe that did this, so that it would suffice to do something like

Code:

unkowncommand -f one_file another_file matches non-matches

I've done a ton of searching and come up with basically nothing. I found some strange utilization of tee here, but couldn't get the "process substitution" to work. (I'm probably just stupid... :) And what should it be? >:(commandline) or >(commandline)? I got neither to work.) One would also think that csplit would do the trick, but it seems to be no good for splitting a file the way I want to do it.

It would be very easy to program something like this myself, but then my script wouldn't be portable at all. So, does anyone have an idea or even a solution to this problem? Please don't say that I have to do the mkfifo stuff mentioned on the site I linked to ;)

Thank you for your attention.

Cheers

keefaz · 07-16-2004, 02:58 PM

Could you explain more, because, as I get it, I would think the second grep in not needed as the first grep determines if paterns from one_file matches or not text in the another_file.

osvaldomarques · 07-16-2004, 05:38 PM

Mr Bebo,

Do you want to get a list of files which match(/doesn't match) your "regular expression" or you want to obtain 2 files, one with the matches and the other with the "doesn't match"?
Depending on your answer, in the first case, I would suggest you to use

Code:

if [ "`grep -c -f reference_file target_file`" = "0" ]; then
   echo "target_file" >>dont_match_list
else
   echo "target_file" >> matches_list
fi

Otherwise, if you want to get the lines of the files which match in one file and the remaining lines in another file, I would recommend to use awk or perl. I am not fluent in perl, but it is possible to do it. You will have to open two output files and execute the regular expression into the logic oft the language. Depending on the result, True or False, you will print the line in one of the files.
The "tee" command does not serve to you needs because its function is to archive temporary results in other file. Its name and purpose comes from the plumbing hardware. As after a grep you will have only the match(/doesn´t match) all you can store is one side from your need.

Bebo · 07-16-2004, 06:29 PM

osvaldomarques and keefaz,

Thanks for your swift answering of my call for aid

Sorry for being unclear. And my choice of filenames were really bad; they were perfect for creating misunderstandings...

Never mind the two files for a while. Let's say I have one really long file, called longfile.txt. I want to grep out those lines containing a certain string, namely specialstring. This is of course an ordinary grep:

Code:

grep specialstring longfile.txt

OK, this far all is good and well, but what if I also want to see the lines in longfile.txt that do not start with specialstring? The obvious solution is to just put a -v in the grep'ing. So, for me to be able to see both matching lines and not matching lines, I need two grep's:

Code:

grep specialstring longfile.txt > matchinglines.txt
grep -v specialstring longfile.txt > notmatchinglines.txt

Now, let's say that I want to match the lines in longfile.txt against a list of special strings. If this list is called specialstrings.txt, then I could just do

Code:

grep -f specialstrings.txt longfile.txt

to get the matching lines, and of course the -v to get the not matching lines.

But it's not that simple: the files longfile.txt and specialstrings.txt are very long, and what's even worse is that I have many longfile.txt and specialstrings.txt files. I have to do it in a loop, like this:

Code:

for SPECIALSTRINGFILE in /blah/bleh/* ; do
   grep -f $SPECIALSTRINGFILE longfile.txt > matchinglines.txt
   grep -vf $SPECIALSTRINGFILE longfile.txt > notmatchinglines.txt

   ...do stuff with the matchinglines.txt and notmatchinglines.txt...
done

I can live with one grep - it's necessary - but it just takes too long with two grep lines!

...hence my question. And you're perfectly right on tee - it doesn't do me any good. I have started thinking of using awk or even sed, actually, cause I want it in a bash script, but I'm not that good at either, so I was hoping for some help on this

Actually, I find it odd that it doesn't seem to exist any command for this kind of problem. The closest one I've found is csplit, but it seems pretty useless for this problem.

I hope I've made it clearer what I'm looking for, and excuse me for any obviousness

And thanks again.

osvaldomarques · 07-16-2004, 09:53 PM

Several years ago I needed to get a big file (+- 50k lines) and identify each line based in its literal content in an "accounting plan". I don't know the correct name in English; for you to give an idea, suppose you have the word "tree" in a line and you have to give it an account "1.10.01", if you have "hardware" you need to give an account "2.33.21" and so on. This was the time of the 386/8MB. This script ran for 38 hours, but it did it.
I understood your needs but I don't know the time frame and frequency you have to do it. If it's a job for once in the life, using general tools may solve your problem. If you have to do it everyday, maybe it's best to write a c program to do it.

Using general tools, maybe all you need is to create an output file with all the matches marked, sort it and cut it in 2 files based on the mark. In this case, instead of a file containing the patterns to search for, you will have an script where you will insert commands to make the substitution. First we need to make some room in the lines to be sure this is our match mark. For example,

Code:

#!/bin/sh
cat /var/log/messages | \
sed 's/^/ :/g' | \
sed 's/^\([^Y]\)\(:.*kernel.*\)$/Y\2/g' | \
sed 's/^\([^Y]\)\(:.*dhcpd.*\)$/Y\2/g' | \
tee subst.txt | \
sort | \
tee sort.txt | \
awk '
{
  if (substr($0, 1, 1) == "Y")
    printf("%s\n", substr($0, 3)) >"matches.txt" 
  else
    printf("%s\n", substr($0, 3)) >"dont-match.txt"
}'

I did this script to read /var/log/messages and look for lines containing "kernel" and "dhcpd" as matches and the rest to go to "dont-match". My log have 10k lines and is a flash. If I had the athlon that time, life would not be so venturous.
In this example, all you need is to insert more lines with the regular expressions you need. Note that sed look for lines which does not start with Y to check the rest of the expression. You need to construct your expression after the colon ":" and end it before the backslash-parenthesis. If you have some expression which contains slash (/), you can substitute the "containers" (par-me my ignorance for jargon names) oft the sed by any non alphabetical character, for example %, @ = +, etc. This (containers/separators) occurs after the "s", between the "$" and the "Y" and before the "g".
Your expression may be as complex as you need but must contemplate all the line, because we will substitute the expression found in the output.

I hope this helps.

Bebo · 07-17-2004, 10:48 AM

Aha, this is very good. I didn't think of putting a marker on the matching lines. I think that even csplit might be useful here, replacing the awk part, since csplit splits the file right at the first match, which makes it useful in sorted files. BTW, I think you can skip the "g" at the end of the sed lines: sed 's/blah/bleh/g'. The g makes sed find all matching parts of a line, not just the first match. But since there is only one start of a line - i.e. one ^ - the g is superfluous.

Great! Thanks!

EDIT: Oh, BTW, do you know how I can use this method with a file containing the interesting strings? Your reference_file in your first post. Without a loop...

osvaldomarques · 07-17-2004, 01:44 PM

Hi Bebo,
There is no omelet without breaking eggs. Again, I don't know the size of your data neither the number of special strings, the frequency of your processing. But, we all know that the loop can be very inefficient. In my last post I suggested instead of a simple string file you have that script with one "sed" for each string. If your string set is variable, I would suggest a pre-processing phase which read you string input and compose the script before submitting your file for the matches.

jschiwal · 07-18-2004, 08:39 PM

I think this calls for a couple of sed scripts. Actually, It wouldn't be hard to use sed to produce the needed sed scripts.

Assuming that strings that you need to match don't contain wild card characters like '.' '*', etc, this would produce the sed script from the 'longfile.txt' file.
sed 's#$^.*$#/\1/p#p' stringfile | uniq >matches.sed
sed 's#$^.*$#/\1/p#d' stringfile | uniq >nomatches.sed

After this you could produce your matched lines list like this:
sed -n -f matches.sed longfile.txt
and your unmatched lines list like this:
sed -f nomatches.sed longfile.txt

Bebo · 07-19-2004, 06:52 AM

Thanks a lot guys for your help. I think I have some testing to do now - we'll see what will be fastest