[SOLVED] Get strings distributed along up to 3 lines

Perseus · 08-27-2013, 01:28 PM

Hello to all in forum,

Please some help.

I don't know if is a work for awk, sed, perl,etc.

Having the following text, I want to extract 2 patterns and print related patterns in the same line:

Code:

pattern1: bc[0-9]d
pattern2: jk[0-9]lmnopqrs

How to know if they are related? Pattern1 always happens, but pattern2 not always. Then, if pattern1 is found and
the next pattern found is pattern2, then they are related and should be printed in the same line. If 2 consecutive
patterns1 are found (in 2 or 3 lines or in the same line), it means that for the previous pattern1 there is no pattern2.

Input:

Code:

abc1defghi
jk3lmnopqr
stuvwxyzza
bc4defghij
klmnopuqrs
tuvwxxyzab
c8defghijk
4lmnopqrst
uvwxyzwwww

Output desired:

Code:

bc1d ijk3lmnopqrs
bc4d 
bc8d ijk4lmnopqrs

I don't now if with awk is possible because the problem is that awk
reads line by line and as you can see, the patterns could begin in one line
and ends in the next one. And even begin in one line and ends 2 lines below.
The goal is know how to do it for this sample file and then, extend it for
a big file.

Thanks in advance for any help.

konsolebox · 08-27-2013, 03:11 PM

If your text is not divided by newlines you could use grep:

Code:

grep -o -e 'bc[0-9]d' -e 'jk[0-9]lmnopqrs' file

Output:

Code:

bc1d
jk3lmnopqrs
bc4d
bc8d
jk4lmnopqrs

With that output it should now be easy to select which are valid.

Perseus · 08-27-2013, 03:20 PM

Hello konsolebox,

Thanks for answer.

The file doesn't have blank lines, but it has newlines characters as any standard file.

I'm trying in Cygwin but I only get this result.

Code:

$ grep -o -e 'bc[0-9]d' -e 'jk[0-9]lmnopqrs' file
bc1d
bc4d

Thanks in advance for any help.

konsolebox · 08-27-2013, 03:35 PM

You can use this C code to convert your file:

Code:

#include <unistd.h>

#define BUFFER_SIZE 2000
char buffer[BUFFER_SIZE];

int main (void) {
    int count;
    while ((count = read(0, buffer, BUFFER_SIZE))) {
        int i, j;
        for (i = 0, j = 0; i < count; ++i) {
            if (buffer[i] == '\n') {
                if (i > j) {
                    write(1, buffer + j, i - j);
                }
                j = i + 1;
            }
        }
        if (i > j) {
            write(1, buffer + j, i - j);
        }
    }
}

Compile it and do:

Code:

./output_binary < file | grep -o -e 'bc[0-9]d' -e 'jk[0-9]lmnopqrs'

danielbmartin · 08-27-2013, 03:57 PM

Quote:

Originally Posted by Perseus

The file doesn't have blank lines, but it has newlines characters as any standard file.

You may use the excellent grep provided by konsolebox this way ...

Code:

 paste -s -d"\0" <$InFile2                   \
|grep -o -e 'bc[0-9]d' -e 'jk[0-9]lmnopqrs'  \
|paste -s -d" "                              \
|sed 's/\(bc[0-9]d\)/\n\1/g'                 \
>$OutFile

... to produce this ...

Code:

bc1d jk3lmnopqrs 
bc4d 
bc8d jk4lmnopqrs

Daniel B. Martin

Perseus · 08-27-2013, 04:19 PM

Hello konsolbox and Daniel,

I'll try asap your code. The original input file is a dump from a binary file got with xxd command and produces a file of 4GB with 256 characters per line.

Do you I could use the same code with this large file?
Or there is a way to use the regex for the patterns to read directly from binary?

Thanks for help again.

Perseus · 08-28-2013, 01:45 AM

Hello konsolebox and Daniel,

I have an issue to extract patterns when they are in the same line.

If I want to extract the patterns c+number+some characters + k+ number + 7 chracters (in blue below):

Code:

abc1defghijk3lyyuopqtstuvwxyzzabc4defghijklmnopuqrstuvwxxyzabc8defghijk5lmnopqrstuvwxyzwwww

I'm getting instead of those 2 strings, the long string below.

Code:

$ echo "abc1defghijk3lmnopqrstuvwxyzzabc4defghijklmnopuqrstuvwxxyzabc8defghijk5lmnopqrstuvwxyzwwww" | grep -o -e 'c[0-9].*k[0-9].\{7\}'
c1defghijk3lmnopqrstuvwxyzzabc4defghijklmnopuqrstuvwxxyzabc8defghijk5lmnopqr

How can set grep to extract separated those 2 strings?

Thanks in advance for your help.

pan64 · 08-28-2013, 01:59 AM

I would try to use the string bc as line separator (instead of newline)
next remove all the newlines
finally print matching lines using regexp like
^[0-9]d.*jk[0-9]lmnopqrs

you can use awk or perl to implement it

Perseus · 08-28-2013, 02:11 AM

Hello Pan64,

May you help me please in how to it in awk or perl.

The thing is as explained in first post, I need 2 patterns. Pattern1 always happens
And patter2 not always, but both could be in more than one or two lines
With an input of 128 bytes per line (xxd used to dump).

Thanks for any help

konsolebox · 08-28-2013, 02:21 AM

@Perseus Have you tried my solution? So how was it? What was needed to change it?

pan64 · 08-28-2013, 03:26 AM

Something like this:
\n? is there because newline can be found almost anywhere

Code:

awk 'BEGIN { RS="b\n?c"; }                  # set record separator
     ! /^\n?[0-9]\n?d/ { next }             # skip lines
   { gsub("\n", "");                        # remove \n
     printf "bc" substr($0, 0, 2);          
     if ( match($0, "jk[0-9]lmnopqrs") ) 
         printf " " substr($0, RSTART, RLENGTH);
      print ""
   } ' input.txt

danielbmartin · 08-28-2013, 10:43 AM

Quote:

Originally Posted by Perseus

... I want to extract the patterns c+number+some characters + k+number+7 characters ...

Try this ...

Code:

awk -F "" 'BEGIN {RS="c"} 
  {k=index($0,"k");
   if (k>0 && NF>k+7 && "0123456789"~$1 && "0123456789"~$(k+1))
     print RS substr($0,1,k+8)}' $InFile >$OutFile

Daniel B. Martin

Perseus · 08-28-2013, 11:58 PM

Hello to all

Mamy thanks for the help and time to help.

Sure I've tried the codes of all of you, but when I try to replicate in a real file with grep or awk,
it seems the regex is not working for pattern-2. I want to extract these patterns:

pattern-1: ff77 + 6 to 18 characters + 532064 + 10 characters + 814 + 13 characters
pattern-2: 059 + 32 to 34 characters + some characters + 940e + 28 characters

For pattern1 the regex I'm using is working, but for the pattern 2 is taken more characters that
I want.

Regex used for pattern-1: ff77.{6,18}532064.{10}814.{13} --> it works
Regex for pattern-2: 059.{32,34}.*940e.\{28\} --> Is taken character belonging to more than one pattern2.

Always, after the end of pattern-2 it follows 9506.

The regex for pattern-2 I have now is taken all characters in red.

Code:

93114444444c55535f529332939333303693303032353807ffffffffffffffff77000001532064022272619f81422060001fffff0015000a4800015a00074200
013300013600013700016600016500017700016900017900009300012200002100010900010a00012600010800012b00002c00002d00002e0000550000560007
2a00002f0000300000930000ff3400800932c90600000000a000800935c90600000000000080093cc90600000000800005910f01020000000d8147451907ffff
ff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff020102019506000000000000ff7700
0002532064014041612f81422060002fffff0015000a4800015a0007420001330001360001370001660001650001770001690001790000930001220000210001
0900010a00012600010800012b00002c00002d00002e00005500005600072a00002f0000300000930000ff3400800932c90600000000a000800935c906000000
00000080093cc90600000000800005910f01020000000d8147451925ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559f
ffff00940e01020102010001ffffff020102019506000000000000ff77000003532064022280546f81422060003fffff0015000a4800015a0007420001330001
3600013700016600016500017700016900017900009300012200002100010900010a00012600010800012b00002c00002d00002e00005500005600072a00002f
0000300000930000ff3400800932c90600000000a000800935c90600000000000080093cc90600000000800005910f01020000000d8147451905ffffff008930
010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff020102019506000000000000ff770000045320
64022939276f81422060004fffff0015000a4800015a00074200013300013600013700016600016500017700016900017900009300012200002100010900010a
00012600010800012b00002c00002d00002e00005500005600072a00002f0000300000930000ff3400800932c90600000000a000800935c90600000000000080
093cc90600000000800005910f01020000000d8147451944ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff0094
0e01020102010001ffffff020102019506000000000000ff77000005532064013741169f81422060354fffff0015000a4800015a000260000133000136000137
00017e00016900006a00007900009300012200002100010900010a00012600010200010400010500010600011000010800012b00002c00002d00002e00005500
005600072a00002f0000300000930000ff3400800932c90688888000a000800935c906000080000000800943c9068888800080000582002e0501000001006500
00000200000200180000000300000300170000000400000400010000000a00ffff0065000000ff77000006532064013741255f81422079900fffff0015000a48
00015a00026000013300013600013700017e00016900006a00007900009300012200002100010900010a00012600010200010400010500010600011000010800
012b00002c00002d00002e00005500005600072a00002f0000300000930000ff3400800932c90688888000a000800935c906000080000000800943c906888880

And the output desired for regex 2 is:

Code:

93114444444c55535f529332939333303693303032353807ffffffffffffffff77000001532064022272619f81422060001fffff0015000a4800015a00074200
013300013600013700016600016500017700016900017900009300012200002100010900010a00012600010800012b00002c00002d00002e0000550000560007
2a00002f0000300000930000ff3400800932c90600000000a000800935c90600000000000080093cc90600000000800005910f01020000000d8147451907ffff
ff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff020102019506000000000000ff7700
0002532064014041612f81422060002fffff0015000a4800015a0007420001330001360001370001660001650001770001690001790000930001220000210001
0900010a00012600010800012b00002c00002d00002e00005500005600072a00002f0000300000930000ff3400800932c90600000000a000800935c906000000
00000080093cc90600000000800005910f01020000000d8147451925ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559f
ffff00940e01020102010001ffffff020102019506000000000000ff77000003532064022280546f81422060003fffff0015000a4800015a0007420001330001
3600013700016600016500017700016900017900009300012200002100010900010a00012600010800012b00002c00002d00002e00005500005600072a00002f
0000300000930000ff3400800932c90600000000a000800935c90600000000000080093cc90600000000800005910f01020000000d8147451905ffffff008930
010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff020102019506000000000000ff770000045320
64022939276f81422060004fffff0015000a4800015a00074200013300013600013700016600016500017700016900017900009300012200002100010900010a
00012600010800012b00002c00002d00002e00005500005600072a00002f0000300000930000ff3400800932c90600000000a000800935c90600000000000080
093cc90600000000800005910f01020000000d8147451944ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff0094
0e01020102010001ffffff020102019506000000000000ff77000005532064013741169f81422060354fffff0015000a4800015a000260000133000136000137
00017e00016900006a00007900009300012200002100010900010a00012600010200010400010500010600011000010800012b00002c00002d00002e00005500
005600072a00002f0000300000930000ff3400800932c90688888000a000800935c906000080000000800943c9068888800080000582002e0501000001006500
00000200000200180000000300000300170000000400000400010000000a00ffff0065000000ff77000006532064013741255f81422079900fffff0015000a48
00015a00026000013300013600013700017e00016900006a00007900009300012200002100010900010a00012600010200010400010500010600011000010800
012b00002c00002d00002e00005500005600072a00002f0000300000930000ff3400800932c90688888000a000800935c906000080000000800943c906888880

Thanks in advance for any help.

pan64 · 08-29-2013, 02:04 AM

yes, this is the greediness of the regexp I think. You need to set ff77 as record separator to avoid such problems.

firstfire · 08-29-2013, 02:06 AM

Hi.

If you use grep or perl, you may use non-greedy regex `.*?', like this:

Code:

$ tr -d '\n' <infile | grep -Po '059.{32,34}.*?940e.{28}'
05910f01020000000d8147451907ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201
05910f01020000000d8147451925ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201
05910f01020000000d8147451905ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201
05910f01020000000d8147451944ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201

`-P' option tells grep to use perl regular expressions.