[SOLVED] Deleting n number of consecutive occurrences of a pattern

Thirumala! · 11-18-2015, 04:42 AM

Hello All,

I want to delete a particular number of consecutive occurrences of a pattern from the file using awk. Please help me with the same.

Example of the file contents

0000
0010
0011
0000
0000
0000
0000
0000
1111
1111
0010
0000

Now I want to delete only the block where 0000 has repeated 5 times consecutively and keep other 0000's unchanged. How can i do this using awk?

Thanks in advance,
Thirumala

syg00 · 11-18-2015, 05:07 AM

So you want, what have you attempted ?.
You make the effort, we'll help when you run into trouble.

Thirumala! · 11-18-2015, 05:13 AM

Hey syg00,

I have tried the below command

cat temp | awk 'N&&sub(PAT,REPL){N--};1' N=291 PAT="0000" REPL="" > temp1
cat temp1 | sed '/^$/d' > temp2

This command deletes first 291 occurrences but I want to delete the 291 consecutive occurrences.

Thanks,
Thirumala

berndbausch · 11-18-2015, 05:55 AM

Try this:

When the input line matches the pattern, remember the line in an array and count down. If counter is 0, throw the array away and set the counter back to N.
When it doesn't match the pattern:
- if the array isn't empty, less than N patterns were in a row, so write the array out. Clear the array. Set the counter back to N.
- print the current line

I wonder if it can be done with other commands.

grail · 11-18-2015, 06:05 AM

Another thing to consider would be, what if there are more than 5 in a row? Do you delete if it is 6? Or only if another 5, ie. 10?

Thirumala! · 11-18-2015, 06:09 AM

It should not replace if occurrences are more than n. And it should replace only if next set of occurrences are n again.

syg00 · 11-18-2015, 06:32 AM

Nope, we are not going to write it for you.
You have been given some hints - incorporate them in your code. The countdown is a good idea, use it to also test if the current record is equal to the previous.

berndbausch · 11-18-2015, 07:05 PM

Quote:

Originally Posted by berndbausch

Try this:

When the input line matches the pattern, remember the line in an array and count down. If counter is 0, throw the array away and set the counter back to N.
When it doesn't match the pattern:
- if the array isn't empty, less than N patterns were in a row, so write the array out. Clear the array. Set the counter back to N.
- print the current line

Sorry I couldn't resist the itch and ended up writing it. Why not share it then:

Code:

#!/usr/bin/awk -f

BEGIN   { N=5; PAT="0000"; ix=0 }
$0==PAT { saved[ix] = $0; ix++;
          N--
          if (N==0) { delete saved; N=5 }
          next                               }

        { for (i in saved) print saved[i]
          delete saved
          N=5
          print                           }

Adding this condition is left as an exercise:

Quote:

Originally Posted by Thirumala!

It should not replace if occurrences are more than n. And it should replace only if next set of occurrences are n again.

By the way, now I notice that I forget to reset the index variable ix. Thanks to the associative nature of awk arrays, this doesn't seem to be a problem.

grail · 11-19-2015, 12:05 AM

@berndbausch - just remember that now this user may expect to be told answers without doing any work in the future too

But, as you have let the cat out of the bag, here are 2 points of interest:

1. What happens if the last 3 entries in the file are the pattern?

2. If you rethink your use of N, you could reduce it to only being needed once outside the definition

(hint: consider ix values)

Thirumala! · 11-19-2015, 02:07 AM

Hello All,

Thanks for the help. This is the first time i am using awk so took more help.

And rest assured that i will not expect any ready answers from you guys.

Thanks,
Thirumala

berndbausch · 11-19-2015, 03:20 AM

Quote:

Originally Posted by grail

@berndbausch - just remember that now this user may expect to be told answers without doing any work in the future too

But, as you have let the cat out of the bag, here are 2 points of interest:

1. What happens if the last 3 entries in the file are the pattern?

2. If you rethink your use of N, you could reduce it to only being needed once outside the definition

(hint: consider ix values)

Polishing is exercise for the reader, and if somebody has wrong expectations, they can be reset quickly.
Well, whenI have a little more time I may do the polishing just to prove my value

berndbausch · 11-19-2015, 03:23 AM

Quote:

Originally Posted by grail

@berndbausch - just remember that now this user may expect to be told answers without doing any work in the future too

But, as you have let the cat out of the bag, here are 2 points of interest:

1. What happens if the last 3 entries in the file are the pattern?

2. If you rethink your use of N, you could reduce it to only being needed once outside the definition

(hint: consider ix values)

Well an END clause can take care of #1, and my brain is full so no rethinking #2 for now.

syg00 · 11-19-2015, 03:34 AM

Quote:

Originally Posted by berndbausch

Sorry I couldn't resist the itch and ended up writing it. Why not share it then:

Quote:

Thanks to the associative nature of awk arrays, this doesn't seem to be a problem.

They have lots of unexpected behaviours - one of the most notable being that they don't guarantee order.

MadeInGermany · 11-19-2015, 07:00 AM

Not all awk versions print a

Code:

for (i in array)

in the correct order.
Because the order is to be kept, we can store it in a string as well

Code:

awk '
{ buf=buf sep $0; sep=RS }  # add sep and $0 to buf; undefined variables are "" in string context; RS is newline
$0!="0000" { print buf; f=0; buf=sep=""; next }  # print and clear buffer; "next" skips the following code
++f==5 { f=0; buf=sep="" }  # if 5 found then clear buffer; an undefined variable is 0 in number context
END {if (f>0) print buf}  # print a remaining buffer
' temp

grail · 11-19-2015, 07:41 AM

I think some of you might be getting a little too carried away with the order stuff, try and remember what is being stored in the array, ie. it is only the same pattern (0000), so really
order here is pretty irrelevant