deleting lines from a file with specific pattern using AWK

gandhigaurav1986 · 06-06-2010, 12:09 PM

Hi,

I have a file which contains milion of records. It contains 12 columns seperated by "||" (delimeter).

First two fields contain first name and last name of a person. Now my requirement is to delete all those records from this file for which:

First two fields does not contain any alphabet.

For e.g i have below mentioned records in file:

gaurav||gandhi||123||456||789
#a%bcd||123abc||89|90||91
12345||@@@||89||123||234
***||!!!!||98||76||90

Now, last two lines should be removed from this file since first two fields does not contain any alphabet for these two records.
Please help me out on this.......

colucix · 06-06-2010, 12:25 PM

Hi and welcome to LinuxQuestions! If other fields does not contain alphabet characters as in your example, you can simply do:

Code:

awk '/[a-zA-Z]/' file

or using sed:

Code:

sed '/[a-zA-Z]/!d' file

otherwise you should match the two fields specifically, for example by means of something like:

Code:

awk -F"|" '$1 ~ /[a-zA-Z]/ && $3 ~ /[a-zA-Z]/' file

Hope this helps.

grail · 06-06-2010, 08:53 PM

Slight adjustment to colucix's last entry as the delimeter is 2 pipes (and in case you weren't aware, you will need to redirect to a new file):

Code:

awk -F"||" '$1 ~ /[a-zA-Z]/ && $3 ~ /[a-zA-Z]/' file > new_file

syg00 · 06-06-2010, 10:28 PM

Does that work ?. And if it does, wouldn't that be $2 ?.

grail · 06-07-2010, 12:51 AM

Quote:

Does that work ?. And if it does, wouldn't that be $2 ?.

Seems in my haste I should have done a little testing

Code:

awk -F"[|][|]" '$1 ~ /[a-zA-Z]/ && $2 ~ /[a-zA-Z]/' file > new_file

colucix · 06-07-2010, 01:06 AM

Actually I used a single pipe as delimiter and $3 to match the second field ($2 was the null string between the first two pipes).

syg00 · 06-07-2010, 01:23 AM

My comment was directed at @grail post, not yours @colucix.
I'll be more specific in future ...

colucix · 06-07-2010, 02:28 AM

Mine too.

For the sake of the OP, if he will ever pop up again, the field separator in awk can be either a single character or a regular expression. Two or more characters have the side effect to set FS to the last one specified.

In the second example posted by grail the presence of two character lists [...] force awk to interpret it as a regular expression, so that you can actually use two consecutive pipes as field separator.

Cheers!

grail · 06-07-2010, 03:11 AM

yes ... yes ... shoot me down .. lol

@colucix - thanks for the explanation

syg00 · 06-07-2010, 04:34 AM

o.k., let's continue the education (mine).
Why is "[|][|]" considered regex (in this context) but [||] isn't - [||]+ works. (remember I'm still coming to terms with awk).

colucix · 06-07-2010, 09:58 AM

Quote:

Originally Posted by syg00

o.k., let's continue the education (mine).
Why is "[|][|]" considered regex (in this context) but [||] isn't - [||]+ works. (remember I'm still coming to terms with awk).

Actually both are considered regexp, but [||] is a character list that means "match a single character, be it either | or |" (not needed redundancy). Instead [||]+ (which is the same as [|]+) matches one or more occurrences of the character, as in extended regular expressions. The grail's solution

Code:

[|][|]

matches exactly two consecutive characters, each one taken from a character list.

The same if you use something like

Code:

[|&;][|&;]

that matches any of these combinations:

Code:

||   |&   |;   &&   &|   &;   ;;   ;|   ;&

gandhigaurav1986 · 06-07-2010, 10:30 PM

Thanks a lot guys.... my problem is solved now

grail · 06-08-2010, 02:08 AM

Quote:

my problem is solved now

Don't forget to mark as SOLVED then