Bash

LocoMojo · 01-24-2011, 08:45 PM

Hello all,

It's been a very long time since I've mucked around in Bash or Python so I've pretty much forgotten most of it. I've run into a problem I need to resolve at work. I can do it by hand, but it would take me hours upon hours to do it. I'd like to let the computer do the work for me, if at all possible...I'm just not sure how.

You see, I have two text files, "all.txt" and "address.txt". In the "all.txt" file I have email addresses, first name, and last name (approximately 10,000 lines) like so:

someone@somewhere.net John Doe
someoneelse@somewhere.net Jane Doe

In the "address.txt" file, I have email addresses only like so:

someone@somewhere.net
someoneelse@somewhere.net

I need to write a script that will read each line of the "address.txt file and find its corresponding match in the "all.txt" file then print the whole line (email address, first name, and last name)into a file called "matched.txt". If a line in the "address.txt" fails to match a line in the "all.txt" file then I need it to be printed to a file called "no-match.txt".

Hope this makes sense.

What is the best way to go about this, speed, resource, and accuracy wise?

I tried a few things in Bash and Python, but it isn't working out well. I'm back to being a newbie again

Any help or advice would be sincerely appreciated!

Thanks.

LocoMojo

LocoMojo · 01-24-2011, 09:00 PM

As soon as I posted the OP, it dawned on me.

I'm so embarrassed, I forgot about grep.

Thanks anyway.

LocoMojo

ghostdog74 · 01-24-2011, 09:01 PM

Python script

Code:

#!/usr/bin/env python
from collections import defaultdict
h = defaultdict(str)
addr = open("address.txt").read().split()
for line in open("all.txt"):
    s=line.rstrip().split(" ",1)
    h[s[0]] = line

keys = h.keys()
same = set(addr) and set(keys)
diff = set(addr) - set(keys)
match = open("matched.txt","w")
for found in same:
    match.write( h[found] )
match.close()

nomatch = open("no-match.txt","w")
for no in diff:
    nomatch.write(no)
nomatch.close()

LocoMojo · 01-24-2011, 09:53 PM

Hello ghostdog74,

I came back because I found that my bash script didn't actually work 100%.

I saw your post and got excited and tried it out. It was so much faster than mine, but unfortunately it didn't work. I tried it with sample files.

all.txt = 3,102 lines
address.txt = 906 lines

After using your script:

matched.txt = 3,102 lines
no-match.txt = 1 line with many addresses (no new lines)

I skimmed over the files and counted at least 25 "no matches" so the matched.txt file should not equal the number of lines in all.txt.

Thanks though!

My bash script was far less elegant, but it almost worked:

Code:

#!/bin/bash

FILE1=address.txt
FILE2=all.txt

while read line; do
  if grep $line $FILE2; then
    echo $line >> matched.txt
  else
    echo $line >> no-matches.txt
  fi
done < $FILE1

With this script I got:

890 matches
15 no matches

A total of 905 out of 906 lines in address.txt ... strange.

I'll have to fiddle more with this. I like your script though, it was much faster and probably less on the resources, but it was in-accurate.

Thanks again!

LocoMojo

LocoMojo · 01-24-2011, 10:08 PM

In the above post:

"I skimmed over the files and counted at least 25 "no matches" " should have read "I skimmed over the files and counted at least 5 "no matches" ".

Doesn't matter anyway, matches should not exceed 906 (the number of lines being checked against "all.txt"(3,102 lines).

LocoMojo

ghostdog74 · 01-24-2011, 10:33 PM

Quote:

Originally Posted by LocoMojo

Hello ghostdog74,

I came back because I found that my bash script didn't actually work 100%.

I saw your post and got excited and tried it out. It was so much faster than mine, but unfortunately it didn't work. I tried it with s

you only provided a small bit of sample file to work with. And it does work with my code.
Why don't you provide more samples of both files..are they all the same structure? show your expected output also if possible. Its much faster than your bash script since yours need to call grep for EACH line. (O^2).

grail · 01-25-2011, 01:56 AM

Well not a full solution but a quick way to get the first half would be:

Code:

grep -f address.txt all.txt > matched.txt

grail · 01-25-2011, 03:41 AM

As an addition, if you threw this in a bash script you could do the following:

Code:

#!/bin/bash

grep -f address.txt all.txt > matched.txt

awk 'FNR==NR{arr[$1]++;next}!($1 in arr)' matched.txt address.txt > not_matched.txt

Not tested or sure of the performance hit, but I think it should work

Reuti · 01-25-2011, 04:45 AM

There is also the utility join installed often as part of the GNU text tools which will search through two files. Other useful text tools are presented here: GNU text utilities.