[SOLVED] how can I ignore or remove lines with 2 or more identical numbers in the same line?

L4Z3R · 01-18-2018, 06:27 AM

Hi!

I need help again. As always I goggle the internet before I ask here. But I get results of something else.

My question is, how can I ignore or remove lines with 2 or more identical numbers in the same line.

For example, here is a sample numbers list

03 02 01
01 02 01
01 05 07

How can I ignore or remove line 2, 01 02 01, which has two 01's in the list.

In other words, I want each line to have unique numbers. I tried this so far:

Code:

cat nums | tr -s '[0-9][0-9]'
03 02 01
01 02 01
01 05 07

Code:

cat nums | tr -s '01'
03 02 01
01 02 01
01 05 07

Neither one works.

I appreciate any help, suggestions and ideas. Thanks

pan64 · 01-18-2018, 06:34 AM

this is called grouping and backreference. You need to create a group (this what you are looking for) and use backreference to specify repetition of the same string.

Code:

([0-9][0-9]).*\1

or similar, syntax depends on the tool you use.

syg00 · 01-18-2018, 06:47 AM

sed is probably the easiest to delete lines based on content. It accepts regex constructs you have already been been directed to in prior threads.
Read the doco.

L4Z3R · 01-18-2018, 07:21 AM

Quote:

Originally Posted by pan64

this is called grouping and backreference. You need to create a group (this what you are looking for) and use backreference to specify repetition of the same string.

Code:

([0-9][0-9]).*\1

or similar, syntax depends on the tool you use.

Code:

egrep -v "([0-9][0-9]).*\1" nums 
03 02 01
01 05 07

It work like a charm!!! kudos to you pan64!!!

+1

Quote:

Originally Posted by syg00

sed is probably the easiest to delete lines based on content. It accepts regex constructs you have already been been directed to in prior threads.
Read the doco.

Some man pages for some commands are easy to decipher, but man pages for sed, awk and grep can be confusing to understand especially dealing with regex. I know the very, very basics of these commands. Sometimes it's hard to know when to use grouping and how to group it properly. I need to study regex as much as possible.

+1

pan64 · 01-18-2018, 08:26 AM

you can check probably here: https://www.regextester.com/?fam=100025 (if link works)

danielbmartin · 01-18-2018, 10:40 AM

Quote:

Originally Posted by pan64

Code:

([0-9][0-9]).*\1

OP asked:

Code:

...how can I ignore or remove lines with 2 or more identical numbers
in the same line.

His example InFile contained two-digit numbers and your solution produced a correct OutFile for this limited case. I tried to extend your solution to numbers of various lengths and was not successful. Please teach us how this is done. You might like to use this sample InFile...

Code:

03 02 01             (keep)
01 02 01             (toss)
01 05 07             (keep)
04 06 06             (toss)
1234 22 56789 33     (keep)
1234 22 1234 33      (toss)
1234 22 123 33       (keep)

Daniel B. Martin

.

Sefyir · 01-18-2018, 12:11 PM

Is there a clear delimitation between each number?

You can trial this with python3.

Split up string into elements of list -> 'a a b c' into ['a', 'a', 'b', 'c'] and remove extra characters like newlines
Put copy of list into set. ['a', 'a', 'b', 'c'] into {'b', 'a', 'c'} (Sets are unordered and can only contain unique values)
Check the number of elements in the list (['a', 'a', 'b', 'c'] = 4) and set ({'b', 'a', 'c'} = 3). If they are equal, print the original string since no duplicates were detected.

Code:

#!/usr/bin/env python3                                                          
import fileinput         

dlm = ' '
for line in fileinput.input():
    dlm_line = line.strip().split(dlm)
    if len(set(dlm_line)) == len(dlm_line):
        print(line, end='')

Code:

$ cat numbers
03 02 01
01 02 01
01 05 07
04 06 06
1234 22 56789 33
1234 22 1234 33
1234 22 123 33
$ ./duplicates < numbers # Or ./duplicates numbers
03 02 01
01 05 07
1234 22 56789 33
1234 22 123 33

Turbocapitalist · 01-18-2018, 12:12 PM

For variable sized numbers, you'd need to add word boundaries before and after the group as part of the pattern. The notation is different for the different styles of regular expression:

Code:

grep -v -E '\<([0-9]+)\>.*\1' numbers.txt
grep -v -E '\<([0-9]+)\>.*\<\1\>' numbers.txt

grep -v -P '\b([0-9]+)\b.*\1' numbers.txt
grep -v -P '\b([0-9]+)\b.*\b\1\b' numbers.txt

In some it might even be [[:<:]] and [[:>:]]

Edit: wrapped \1 in word boundaries as per reminder by pan64 below.

danielbmartin · 01-18-2018, 01:22 PM

Using the method of Sefyir (post #7) and using this InFile ...

Code:

03 02 01               (keep)
01 02 01               (toss)
01 05 07               (keep)
04 06 06               (toss)
1234 22 56789 33       (keep)
1234 22 1234 33        (toss)
1234 22 123 33         (keep)
77 1234 22 1234 22 99  (toss)

... this awk ...

Code:

awk '{delete a; for (j=1;j<=NF;j++) a[$j]++;
   if (length(a)==NF) print}' $InFile >$OutFile

... produced this OutFile ...

Code:

03 02 01               (keep)
01 05 07               (keep)
1234 22 56789 33       (keep)
1234 22 123 33         (keep)

Daniel B. Martin

.

danielbmartin · 01-18-2018, 01:46 PM

Making a fancier result ...

With this InFile ...

Code:

03 02 01               (keep)
01 02 01               (toss)
01 05 07               (keep)
04 06 06               (toss)
1234 22 56789 33       (keep)
1234 22 1234 33        (toss)
1234 22 123 33         (keep)
77 1234 22 1234 22 99  (toss)

... this awk ...

Code:

awk '{delete a; dupes="";
      for (j=1;j<=NF;j++) if (++a[$j]>1) dupes=dupes $j" "
       if (dupes) print $0"  FAILED; repeats were "dupes
       else print}' $InFile >$OutFile

... produced this OutFile ...

Code:

03 02 01               (keep)
01 02 01               (toss)  FAILED; repeats were 01 
01 05 07               (keep)
04 06 06               (toss)  FAILED; repeats were 06 
1234 22 56789 33       (keep)
1234 22 1234 33        (toss)  FAILED; repeats were 1234 
1234 22 123 33         (keep)
77 1234 22 1234 22 99  (toss)  FAILED; repeats were 1234 22

Daniel B. Martin

.

pan64 · 01-19-2018, 02:11 AM

Quote:

Originally Posted by danielbmartin

I tried to extend your solution to numbers of various lengths and was not successful.

Code:

03 02 01
01 02 01
01 05 07
04 06 06
1234 22 56789 33
1234 22 1234 33
1234 22 123 33
123 456 213 123 678
123 456 786 12345 67
1 5 765346 3

at first, you can simply use +:

Code:

([0-9]+).*\1

but we also need to specify delimiter (to avoid match 234 and 123456), so you need to specify zero length boundaries: http://perldoc.perl.org/perlrebacksl...%7b%7d%2c-%5cB
It is not trivial (looks like zero length pattern cannot be backreferenced), so:

Code:

\b([0-9]+)\b.*\b\1\b

works.

L4Z3R · 01-20-2018, 01:15 AM

Thanks to all here for the new codes you provided. I am slowly learning this complex regex stuff.

BTW, which is easier to learn perl or python?

+1 rep to all

Turbocapitalist · 01-20-2018, 03:21 AM

I think the answer to that question depends on you. But I'll ramble since you ask. I myself find perl much, much easier and quite fun but part of that is that there are some key characteristics of python that I do not like at all and I'm not able to get past that distaste. That said, there was also a big push for a long time to disparage perl. I think it was backed by M$ in an attempt to push one of their failures but instead most people just pivoted to python and (ugh) PHP. perl has much more flexible syntax, a proven mature catalog of modules, and more powerful regular expressions. However, most regex work can still be met by python. In favor of python is that it has be adopted by a great many successful training programmes and initiatives as a training language. The back side of that is that it strikes me as a training language and may end up haunting us 20 years from now in bad ways like BASIC once did. Python enjoys a certain trendiness at the moment. I also suspect, but don't fully have the skill to assess, that perl has been put together better from a CS standpoint.

syg00 · 01-20-2018, 03:29 AM

One assumes all those comments pertain to perl 5. Only.
The schism in perl is no more attractive than that in python. The user has been the victim of the developers once again.

I keep trying to get into python, but it just hasn't happened.

Turbocapitalist · 01-20-2018, 03:33 AM

Quote:

Originally Posted by syg00

One assumes all those comments pertain to perl 5. Only.

Yes. Perl 6 is a totally different language despite the name and the development team. I have not gotten around to looking carefully at Perl 6, it might be good it might not be. However, it is not ubiquitous like Perl 5 is, and has been for decades.