[SOLVED] Get strings distributed along up to 3 lines

Perseus · 09-01-2013, 05:11 PM

Hello grail,

Thanks for the explanation.

For this input:

Code:

93114444444c55535f529332939333303693303032353807ffffffffffffffff77000001532064022272619f81422060001fffff0015000a4800015a00074200
013300013600013700016600016500017700016900017900009300012200002100010900010a00012600010800012b00002c00002d00002e0000550000560007
2a00002f0000300000930000ff3400800932c90600000000a000800935c90600000000000080093cc90600000000800005910f01020000000d8147451807ffff
ff009310010c0000000d8147451805ffffff0101960f010c0000000d81474518559fffff00940e01020102010001ffffff020102019506000000007000ff7700
0002532064014041612f81422060002fffff0015000a4800015a0007420001330001360001370001660001650001770001690001790000930001220000210001
0900010a00012600010800012b00002c00002d00002e00005500005600072a00002f0000300000930000ff3400800932c90600000000a000800935c906000000
00000080093cc90600000000800005910f01020000000d8147451825ffffff009310010c0000000d8147451805ffffff0101960f010c0000000d81474518559f
ffff00940e01020102010001ffffff020102019506000000000000ff77000003532064022280546f81422060003fffff0015000a4800015a0007420001330001
3600013700016600016500017700016900017900009300012200002100010900010a00012600010800012b00002c00002d00002e00005500005600072a00002f
0000300000930000ff3400800932c90600000000a000800935c90600000000000080093cc90600000000800005910f01020000000d8147451905ffffff009310
010c0000000d8147451805ffffff0101960f010c0000000d81474518559fffff00940e01020102010001ffffff020102019506000000000000ff770000045320
64022939276f81422060004fffff0015000a4800015a00074200013300013600013700016600016500017700016900017900009300012200002100010900010a
00012600010800012b00002c00002d00002e00005500005600072a00002f0000300000930000ff3400800932c90600000000a000800935c90600000000000080
093cc90600000000800005910f01020000000d8147451844ffffff009310010c0000000d8147451805ffffff0101960f010c0000000d81474518559fffff0094
0e01020102010001ffffff020102019506000000000000ff77000005532064013741169f81422060354fffff0015000a4800015a000260000133000136000137
00017e00016900006a00007900009300012200002100010900010a00012600010200010400010500010600011000010800012b00002c00002d00002e00005500
005600072a00002f0000300000930000ff3400800932c90688888000a000800935c906000080000000800943c9068888800080000582002e0501000001006500
00000200000200180000000300000300170000000400000400010000000a00ffff0065000000ff77000006532064013741255f81422079900fffff0015000a48
00015a00026000013300013600013700017e00016900006a00007900009300012200002100010900010a00012600010200010400010500010600011000010800
012b00002c00002d00002e00005500005600072a00002f0000300000930000ff3400800932c90688888000a000800935c906000080000000800943c906888880

With the script I have so far:

Code:

ruby -ne 'BEGIN{$/="ff77"};
    $_.gsub!(/\n/,"");
    $_.split($/).each{
        |x| 
		if x =~ /(^.{6,18})(532064.{9}).(814(\d){1,13})/;
			printf("%d %s %s","0x"+$1,$2,$3)
			if x =~ /(05)(9[0-9])([1-9][0-9a-f]|0[e-f]0[1-9a].{26,28}0[0-1].*?)(940e)(.{28})/;
				str=$2+$3+$4+$5
				map={/91../ => " PROD_1 ",/92../ => " PROD_2 ",\
					/93../ => " PROD_3 ",/94../ => " PROD_F ",/96../ => " PROD_6 "}
				map.each {|k,v| str.sub!(k,v)}
				printf(" %s\n",str)
			else
				printf("\n")
			end
		end		
        }' file

I get this output:

Code:

1 532064022272619 81422060001  PROD_1 01020000000d8147451807ffffff00 PROD_3 010c0000000d8147451805ffffff0101 PROD_6 010c0000000d81474518559fffff00 PROD_F 01020102010001ffffff02010201
2 532064014041612 81422060002  PROD_1 01020000000d8147451825ffffff00 PROD_3 010c0000000d8147451805ffffff0101 PROD_6 010c0000000d81474518559fffff00 PROD_F 01020102010001ffffff02010201
3 532064022280546 81422060003  PROD_1 01020000000d8147451905ffffff00 PROD_3 010c0000000d8147451805ffffff0101 PROD_6 010c0000000d81474518559fffff00 PROD_F 01020102010001ffffff02010201
4 532064022939276 81422060004  PROD_1 01020000000d8147451844ffffff00 PROD_3 010c0000000d8147451805ffffff0101 PROD_6 010c0000000d81474518559fffff00 PROD_F 01020102010001ffffff02010201
5 532064013741169 81422060354
6 532064013741255 81422079900

And my desired output is separate by pipe the substrings between the "PRODUCTs_#" as output below:

Code:

1 532064022272619 81422060001  PROD_1 01|02|00|00000d|8147451807ffffff|00 PROD_3 01|0c|00|00000d|8147451805ffffff|01|01 PROD_6 01|0c|00|00000d|81474518559fffff|00 PROD_F 01|02|01|02|01|00|01|ff|ff|ff|02|01|02|01
2 532064014041612 81422060002  PROD_1 01|02|00|00000d|8147451825ffffff|00 PROD_3 01|0c|00|00000d|8147451805ffffff|01|01 PROD_6 01|0c|00|00000d|81474518559fffff|00 PROD_F 01|02|01|02|01|00|01|ff|ff|ff|02|01|02|01
3 532064022280546 81422060003  PROD_1 01|02|00|00000d|8147451905ffffff|00 PROD_3 01|0c|00|00000d|8147451805ffffff|01|01 PROD_6 01|0c|00|00000d|81474518559fffff|00 PROD_F 01|02|01|02|01|00|01|ff|ff|ff|02|01|02|01
4 532064022939276 81422060004  PROD_1 01|02|00|00000d|8147451844ffffff|00 PROD_3 01|0c|00|00000d|8147451805ffffff|01|01 PROD_6 01|0c|00|00000d|81474518559fffff|00 PROD_F 01|02|01|02|01|00|01|ff|ff|ff|02|01|02|01
5 532064013741169 81422060354
6 532064013741255 81422079900

To finally get those values between the "PRODUCTs_#" previously separated by "|", in decimal format and without f's.

Code:

1 532064022272619 81422060001  PROD_1 1|2|0|13|8147451807|0 PROD_3 1|12|0|13|8147451805|1|1 PROD_6 1|12|0|13|81474518559|0 PROD_F 1|2|1|2|1|0|1|255|255|255|2|1|2|1
2 532064014041612 81422060002  PROD_1 1|2|0|13|8147451825|0 PROD_3 1|12|0|13|8147451805|1|1 PROD_6 1|12|0|13|81474518559|0 PROD_F 1|2|1|2|1|0|1|255|255|255|2|1|2|1
3 532064022280546 81422060003  PROD_1 1|2|0|13|8147451905|0 PROD_3 1|12|0|13|8147451805|1|1 PROD_6 1|12|0|13|81474518559|0 PROD_F 1|2|1|2|1|0|1|255|255|255|2|1|2|1
4 532064022939276 81422060004  PROD_1 1|2|0|13|8147451844|0 PROD_3 1|12|0|13|8147451805|1|1 PROD_6 1|12|0|13|81474518559|0 PROD_F 1|2|1|2|1|0|1|255|255|255|2|1|2|1
5 532064013741169 81422060354
6 532064013741255 81422079900

I hope be able to manipulate the sub strings in order to get this output.

Thanks for all the help.

grail · 09-02-2013, 07:52 AM

I see some issues in the current process, namely around the data not being set at all points. What i mean is if we look at just the first line in the first 2 sets of output data:

Code:

1 532064022272619 81422060001  PROD_1 01020000000d8147451807ffffff00 PROD_3 010c0000000d8147451805ffffff0101 PROD_6 010c0000000d81474518559fffff00 PROD_F 01020102010001ffffff02010201

1 532064022272619 81422060001  PROD_1 01|02|00|00000d|8147451807ffffff|00 PROD_3 01|0c|00|00000d|8147451805ffffff|01|01 PROD_6 01|0c|00|00000d|81474518559fffff|00 PROD_F 01|02|01|02|01|00|01|ff|ff|ff|02|01|02|01

My specific concerns are highlighted in red:

1. The '9' at the end of that string means that for some reason we are now capturing 11 digits instead of 10 as per the previous sets ... so my question is, how do we know when it should be 10 or 11? (could it be more?)

2. The second shows that in all previous sets we are ignoring 'f' but now we are using it to return values. This one is not as much of a concern as my thought here would be to process everything prior to 940e and then process this one separately as it appears to have a different set of rules.

Currently my idea looks something like below but need more information to have a better picture:

Code:

#!/usr/bin/env ruby

BEGIN{  $/="ff77"   }   

File.open(ARGV[0])

while gets
    $_.gsub!(/\n/,"")
    
    $_.split($/).each{
        |x|

        next unless x =~ /^(.{6,18})(532064[^f]*).(814[^f]*)/

        printf("%d %s %s\n",$1,$2,$3)
    
        if x =~ /05(9.{32,34}.*?)(940e.{28})/

            $1.scan(/9.*?(?=9\d|$)/).each{
                |y|

                puts "|" + y + "|" 
                printf("PROD_%c ",y[1])
                s = y[4..-1].gsub(/f/,"")

                s =~ /(.{2})(.{2})(.{2})(.{6})(.{10})(.*)/
                puts "\t" + $1 + "|" + $2 + "|" + $3 + "|" + $4 + "|" + $5 + "|" + $6
            }
        else
            puts
        end
    }   
end

The above has errors and is not complete, but may give you an idea of where my questions above are going?

grail · 09-02-2013, 02:55 PM

Ok ... see what ya think of this (I have converted to a script instead of command line as now way to big):

Code:

#!/usr/bin/env ruby

BEGIN{	$/="ff77"	}

File.open(ARGV[0])

while gets
	$_.gsub!(/\n/,"")
	
	$_.split($/).each{
		|x|

		next unless x =~ /^(.{6,18})(532064[^f]*).(814[^f]*)/

		printf("%d %s %s",$1,$2,$3)
	
		if x =~ /05(9.{32,34}.*?)940e(.{28})/

			rest = $2

			$1.scan(/9.*?(?=9[1-9]|$)/).each{
				|y|

				printf(" PROD_%c",y[1])
				arr = y[4..-1].scan(/(.{2})(.{2})(.{2})(.{6})([^f]*)f*(.*)/)[0]
				
				arr.each_index{
					|i|

					if i > 4 && arr[i].length > 2
						arr << arr[i][2..-1]
						arr[i] = arr[i][0..1]
					end

					arr[i] = arr[i].to_i(16) if i != 4
				}
				printf(" %s",arr * "|")
			}
			printf " PROD_F "

			puts rest.scan(/../).map{|z| z.to_i(16)}.join("|")
		else
			puts
		end
	}
end

Running this on the current output I get:

Code:

$ ./Perseus.rb file
1 532064022272619 81422060001 PROD_1 1|2|0|13|8147451807|0 PROD_3 1|12|0|13|8147451805|1|1 PROD_6 1|12|0|13|81474518559|0 PROD_F 1|2|1|2|1|0|1|255|255|255|2|1|2|1
2 532064014041612 81422060002 PROD_1 1|2|0|13|8147451825|0 PROD_3 1|12|0|13|8147451805|1|1 PROD_6 1|12|0|13|81474518559|0 PROD_F 1|2|1|2|1|0|1|255|255|255|2|1|2|1
3 532064022280546 81422060003 PROD_1 1|2|0|13|8147451905|0 PROD_3 1|12|0|13|8147451805|1|1 PROD_6 1|12|0|13|81474518559|0 PROD_F 1|2|1|2|1|0|1|255|255|255|2|1|2|1
4 532064022939276 81422060004 PROD_1 1|2|0|13|8147451844|0 PROD_3 1|12|0|13|8147451805|1|1 PROD_6 1|12|0|13|81474518559|0 PROD_F 1|2|1|2|1|0|1|255|255|255|2|1|2|1
5 532064013741169 81422060354
6 532064013741255 81422079900

There are possibly ways to streamline it a bit more, but this seems to work with the current data

Perseus · 09-03-2013, 12:49 AM

Hello grail,

I haven't answered you before because after see your code in post #32, I was breaking my head trying
to use the function "scan" you used to print separated values and in decimal format. I got this test code,
to, but obviously is missing something to fix in the print. After that you sent the other code in post #33.

This is what I was trying before you sent your last code.

Code:

pat="05910f01020000000d8147451807ffffff009310010c0000000d8147451805ffffff0101960f010c0000000d81474518559fffff00940e01020102010001ffffff02010201"
 
pat.scan(/(9\d)(..)(..)(..)(..)(.{6})(\d{1,16})(f{1,16})([0-1]{2})([0-1]{0,2})/).each{
|y| 
for i in y
    print $1,"|",$2.hex,"|",$3.hex,"|",$4.hex,"|",$5.hex,"|",$6.hex,"|",$7,"|",$9.hex,"|",$10.hex                           
end
}

Quote:

Originally Posted by grail

1. The '9' at the end of that string means that for some reason we are now capturing 11 digits instead of 10 as per the previous sets ... so my question is, how do we know when it should be 10 or 11? (could it be more?)

Yes, these fields that begin with 532064.. and 814.. are formed by 16 characters, a variable number of digits and padding f's.

A normal string get by regex2 is a the following.

Code:

05910f01020000000d8147451807ffffff009310010c0000000d8147451805ffffff0101960f010c0000000d81474518559fffff00940e01020102010001ffffff02010201

So,
1- each sub string begins with 91,93,94 or 96 (could be more like 92, 95, 97 etc)
2- The next 2 characters (in blue) after 91,93,94 etc are the length of the substring. So,
for 91 the next byte is "0f"=15, then after the "0f" there are 15 bytes (30 characters)
for 93 the next byte is "10"=16, then after the "10" there are 16 bytes (32 characters)

Code:

Ok ... see what ya think of this (I have converted to a script instead of command line as now way to big):

It seems to work just fine, but I don't understand the misterious magic inside some code lines, for example:
In |y| is stored all pattern matched by the regex,but I don't understanf what it means "y[4..-1]"

And one issue is that the strings "PROD_X" are not in sequencial order, just was and example to put PROD_X. Theye are
related like this:

Code:

for 91 --> APPLE
for 93 --> GRAPES
for 96 --> PEAR
for 94 --> ORANGE
.
.
Could be more values

Them with those mapped values, the output desired change a little bit:

Code:

1 532064022272619 81422060001 APPLE 1|2|0|13|8147451807|0 GRAPES 1|12|0|13|8147451805|1|1 PEAR 1|12|0|13|81474518559|0 ORANGE 1|2|1|2|1|0|1|255|255|255|2|1|2|1
2 532064014041612 81422060002 APPLE 1|2|0|13|8147451825|0 GRAPES 1|12|0|13|8147451805|1|1 PEAR 1|12|0|13|81474518559|0 ORANGE 1|2|1|2|1|0|1|255|255|255|2|1|2|1
3 532064022280546 81422060003 APPLE 1|2|0|13|8147451905|0 GRAPES 1|12|0|13|8147451805|1|1 PEAR 1|12|0|13|81474518559|0 ORANGE 1|2|1|2|1|0|1|255|255|255|2|1|2|1
4 532064022939276 81422060004 APPLE 1|2|0|13|8147451844|0 GRAPES 1|12|0|13|8147451805|1|1 PEAR 1|12|0|13|81474518559|0 ORANGE 1|2|1|2|1|0|1|255|255|255|2|1|2|1
5 532064013741169 81422060354
6 532064013741255 81422079900

Thanks for all your help

grail · 09-03-2013, 03:06 AM

Code:

y[4..-1]

This is a slice of the string stored in 'y' starting at the fifth character (as zero based arrays in Ruby) to the -1th character which means start from the right and come back 1, hence the last character.

As for not being sequential, this is not an issue as the regex always matches 940e last, hence ORANGEs are always last, although I guess they may appear earlier as well.

Perseus · 09-03-2013, 08:31 AM

Hello grail,

Thanks for the explanation. I unsdertan better.

Regarding the convertion of 91, 93, 96, etc. I was trying to change the printf(" PROD_%c",y[1] )
And put instead an "if" for each value, I mean

Code:

if y[1]=91 then 
print "APPLES"
if y[1]=93 then 
print "GRAPES"
if y[1]=96 then 
print "PEAR"

But is not working and only is printing the patterns of regex1.

What would be wrong?

Thenks again.

grail · 09-03-2013, 09:04 AM

That would be because y[1] is a single character, ie 1, 3 or 6. May I suggest you instead create a hash using the specific numbers as indexes:

Code:

fruit = { 1 => "APPLES", 3 => "GRAPES", 6 => "PEAR" }

This way the only change apart from adding the hash would be to change the following:

Code:

printf(" PROD_%c",y[1] )

# becomes

printf(" PROD_%c",fruit[y[1]] )

Perseus · 09-03-2013, 02:16 PM

Hello grail,

I've added those 2 lines below and it seems to work.

Code:

$1.scan(/9.*?(?=9[1-9]|$)/).each{
	|y|
	fruit = { "1" => " APPLES", "3" => " GRAPES", "6" => " PEAR" }
	printf("%s",fruit[y[1]])

but if I change the regex from

Code:

$1.scan(/9.*?(?=9[1-9]|$)/).each{

to

Code:

$1.scan(/9([0-9])([1-9][0-9a-f]|0[e-f]0[1-9a]).{26,28}(0[0-1]){1,2}/).each{

Is not working. I think your regex is matching the value after "9", and my regex is supposed to match
each string that begins with "9" within pattern 2,but I receive an error using this regex.

May you explain me please, how does it work the regex you use 9.*?(?=9[1-9]|$ in order
to understand how to modify something if I need.

Thanks in advance

grail · 09-03-2013, 05:37 PM

Code:

9.*?(?=9[1-9]|$)

Broken down this says:

9 - a literal 9

.*? - non-greedy search of any characters after the 9

(?=9[1-9]|$) - this is a positive look ahead (see here for details). This means that the data we want must be followed by a number between 91 - 99 or the end of the string ($). The idea of this mechanism is that it is not saved as data we want but must be present after the data we are looking for

If you prefer to layout the entire regex and save all the different portions (as in the regex you have shown), advise what error you are getting and I will see if I can help correct?
Also remember that the line below does do the necessary break up you are looking to accomplish here (as far as I can tell):

Code:

arr = y[4..-1].scan(/(.{2})(.{2})(.{2})(.{6})([^f]*)f*(.*)/)[0]

Perseus · 09-03-2013, 06:33 PM

Hello grail, I hope you're fine!

When, instead of this regex,

Code:

#$1.scan(/9.*?(?=9[1-9]|$)/).each{

I use this other

Code:

$1.scan(/9([0-9])([1-9][0-9a-f]|0[e-f]0[1-9a]).{26,28}(0[0-1]){1,2}/).each{

I get the error shown below:

Code:

$ ruby script.rb file
1 532064022272619 81422060001extract1.rb:22:in `block (2 levels) in <main>': undefined method `scan' for nil:NilClass (NoMethodError)
        from extract1.rb:16:in `each'
        from extract1.rb:16:in `block in <main>'
        from extract1.rb:7:in `each'
        from extract1.rb:7:in `<main>'

Thanks

grail · 09-03-2013, 09:38 PM

This error points to a previous change you would have to make as it is saying that $1 is nil and hence the nil class has no method called scan.
Which means the follow regex has been changed:

Code:

if x =~ /05(9.{32,34}.*?)940e(.{28})/

$1 would refer to - (9.{32,34}.*?)

If this line has not been changed, the other thing you may have done is another regex between these 2 lines, such as a call to sub or gsub, and if these calls do not use brackets to save a back reference
or the item being searched for does not exist, then again $1 will be nil

Perseus · 09-04-2013, 02:02 AM

Hello grail,

I understand now your look ahead regex, thank you for explain me, but your regex and the long regex I've tried match the same
strings within $1, but with my regex it fails, I'm not sure if is because contains "(" and ")".

Code:

$1.scan(/9\d([1-9][\da-f]|0[e-f]|0[1-9a]).{26,28}(0[0-1]){1,2}/).each{

You can try outside that this regex match the same characaters as the shorter version of look ahead, but inside
the script fails.

Regards

grail · 09-04-2013, 05:09 AM

Well I replaced my line with yours and get an error related to the fact that the printf is not receiving what it is looking for, but not the error you are getting.

I will says again, that your error points to the fact that $1 is nil, may i suggest placing the following on the line immediately preceding this entry:

Code:

puts $1

This will show you what the regex is scanning before it does so.

Perseus · 09-04-2013, 05:42 PM

Hello grail,

I've tried adding puts $1 to check the content of $1.

If I use this regex

Code:

$1.scan(/9.*?(?=9[1-9]|$)/).each{

I get correctly the content of $1 for each iterarion.

Code:

$ ruby extract1.rb file
910f01020000000d8147451807ffffff009310010c0000000d8147451805ffffff0101960f010c0000000d81474518559fffff00
910f01020000000d8147451825ffffff009310010c0000000d8147451805ffffff0101960f010c0000000d81474518559fffff00
910f01020000000d8147451905ffffff009310010c0000000d8147451805ffffff0101960f010c0000000d81474518559fffff00
910f01020000000d8147451844ffffff009310010c0000000d8147451805ffffff0101960f010c0000000d81474518559fffff00

If I use ths regex

Code:

$1.scan(/9\d([1-9][\da-f]|0[e-f]|0[1-9a]).{26,28}(0[0-1]){1,2}/).each{

I get the the content of $1 only for the first iteration and the following error:

Code:

$ ruby extract1.rb file
910f01020000000d8147451807ffffff009310010c0000000d8147451805ffffff0101960f010c0000000d81474518559fffff00
extract1.rb:24:in `block (2 levels) in <main>': undefined method `scan' for nil:NilClass (NoMethodError)
        from extract1.rb:17:in `each'
        from extract1.rb:17:in `block in <main>'
        from extract1.rb:7:in `each'
        from extract1.rb:7:in `<main>'

I don't undertand why if both are valid regex to match the same strings.

And I try to use the long regex because if 9X occurs in the middle of 2 strings that I really want to match, will
match a smaller subtring. Because of that I'm trying to force the length of the substring putting "{26,28}".

Thanks again.

grail · 09-05-2013, 07:53 AM

Did you try my previous suggestion?

I am not getting the same output or error as you so I would suggest you are either using different data or you have changed another part of the code as well as the line you have mentioned.

Please provide your current code and test data.
Also, what version of Ruby are you running?