text match pipe to file then delete from original text file create new dir automatic

tr1px · 09-09-2008, 12:08 PM

I have a huge file of about 3 million records with email data. So far I use:

cat split_?.txt |grep -i '\<domain.com' >> ./domain/domain.txt

This takes emails matching the domain and puts them in a file for that domain.

Now I need to delete those files from domain.txt from the original. My ultimate goal is to be able to automate the whole process using a shell script which I am learning right now.

I want to take file1.txt which has email data or records and have a script go through and look at all the domains in there. Then the script is suppose to create a folder matching the domain text. Now after this I want to delete the row or record from file1.txt.

file1 -> Look at domain -> create Folder for domain -> put record in new file in domain folder -> delete record from file1

I hope this is not to complicated...

For right now help with

cat split_?.txt |grep -i '\<domain.com' >> ./domain/domain.txt ----> and then delete the record from split_?.txt would be ok.

Thank you in advance.

arckane · 09-09-2008, 04:20 PM

Looks like a sed or awk approach I think... hmmm, let me think.

Just to make sure I get this right:

You'll end up with multiple files called /domain/domain_name.txt, so in theory you could go through the directory of files, pull out any of the domain_name sections, search for matching ones and remove?

If that's the case then something like:

Code:

for I in ./domain/*.txt; do T=${I##./domain/}; I=${T%.txt}; sed -n "/$I"'/!p' ./file.txt ; done

Once you're happy that you get the correct output, change the sed line to be "sed -i -n ..." and the -i will edit the file.txt rather than keep parsing and printing the whole file with missing lines.

TEST THIS FIRST, don't take it in stone. I've quickly played and it works for me...

chrism01 · 09-09-2008, 06:33 PM

There's enough ops involved that I'd recommend actually writing a proper shell script instead of a one-liner. Much easier to debug and eg you need to check each one to see if it exists before creating it.

http://tldp.org/LDP/Bash-Beginners-G...tml/index.html
http://www.tldp.org/LDP/abs/html/

tr1px · 09-10-2008, 02:03 PM

Let me draw this out a little more clear.

I have a directory that contains around 37 million email records with fname lname addr email ... split into 13 files. split_1.txt, split_2.txt ... through split_13.txt. I now need a script that can read through file by file and look for domains. (ex. mike@domain.com [mike]@[domain.com]). The file it is going through looks similar to this:

"mike","dawson","23 kimber lane","hollywood","FL","33020","5553211234","mike@domain.com","blah","blah-blah"

I want to automatically have a script look at each records and pull [domain.com] check if the folder exists /somedir/domain.com, if it does not -> create it and then... if it exists create file called /somedir/domain.com/domain.com.txt and enter the record into that file. Once that is finished I would like the record to be deleted from which ever split_?.txt file it came from. The deleting part is not that important. It is only to save disk space. If someone can help me out with this I will love you forever. I have been doing this sort of manually and it takes hours having to pull through known domains and having to wait.

I know how to create loops and I am close to having the answer but seem a bit clueless.

chrism01 · 09-10-2008, 06:58 PM

Ok, that's much clearer.

1. there are many ways to do it (I'd use Perl), but if you've nearly got the answer, how about posting it along with what's wrong and we can help you fix it.
2. When it's done I'd gzip the orig files and back them up for ref
3. IIUC, you want a dir for every domain and a file for every user. If you have 37M users you may run out of inodes before disk space (use df -i to check)

HTH

tr1px · 09-10-2008, 08:18 PM

No, I want a file per domain with all the email records from that domain in one file. ex:

"john","dude","2334 kimber lane","hollywood","FL","33020","5553211234","john@domain.com","blah","blah-blah"
"mike","dawson","23121 kimber lane","hollywood","FL","33020","5553211234","mike@domain.com","blah","blah-blah"
"paul","walker","2346 kimber lane","hollywood","FL","33020","5553211234","paul@domain.com","blah","blah-blah"
"jody","jane","2334 kimber lane","hollywood","FL","33020","5553211234","jody@domain.com","blah","blah-blah"
"tim","stuart","2334 kimber lane","hollywood","FL","33020","5553211234","tim@domain.com","blah","blah-blah"
"mike","jones","23566 kimber lane","hollywood","FL","33020","5553211234","mike@domain.com","blah","blah-blah"

lets say the above records are all different people but their email addresses are from the same domain they all belong in the
/domain.com/domain.com.txt

lets say they all use hotmail above they then would go in
/hotmail.com/hotmail.com.txt

Now when I said I was almost there I ment I can do all this manually with
cat split_?.txt |grep -i '\<domain.com' >> ./domain.com/domain.com.txt

chrism01 · 09-10-2008, 09:40 PM

Copied your data into a file t.t and ran this

Code:

#Set IFS to hardcoded newline only; default is space,tab,newline
IFS="
"

for rec in `cat t.t`
do
    user_dom=`echo $rec|cut -d',' -f8`
    echo $user_dom   #debug = user@domain.com
    domain=`echo $user_dom|cut -d'@' -f2|cut -d'"' -f1`
    echo $domain #debug = domain.com

    #Add to file
    echo $rec >>tmp/${domain}/${domain}.txt
done

HTH