AWK/SED Multiple pattern matching over multiple lines issue

GigerMalmensteen · 11-24-2006, 04:42 AM

I have to construct a maintenance program, part of this program is the interrogation of log files.

Ordinarily a grep or sed would sort me right out however this problem has a few other restrictions.

I have to initially get the current date from the system and then match this to entries in a log file. Not a problem, already done. However once I have located a matching line I then have to step over the next lines looking for another pattern and, if found, write these entries to a file. I can ONLY use either grep, sed or awk to do this. I believe awk will do it no problem however I am not familiar with all it's aspects. An example of the data may help:

test.log:

2006 Nov 06 18:01:25:538 GMT +1 userQueue - Job-18494
s/QueryLog]: located user queue on line 654 of system 5432
2006 Nov 06 18:03:25:538 GMT +1 userQueue - Job-18494
s/QueryLog]: located user queue on line 654 of system 5432
2006 Nov 06 18:04:25:538 GMT +1 userQueue - Job-18494
s/QueryLog]: located user queue on line 654 of system 5432
2006 Nov 06 18:06:25:538 GMT +1 userQueue - Job-18494
s/QueryLog]: located user queue on line 654 of system 5432
2006 Nov 06 18:07:25:538 GMT +1 userQueue - Job-18494
s/QueryLog]: located user queue on line 654 of system 5432
2006 Nov 06 18:08:26:179 GMT +1 userQueue - [Unknown
] - Severity: 2; Category: ; ExceptionCode: ; Message: unable to create new nati
ve thread; Parameters: <n/a>; Stack Trace: Job-18507 Error in userQueue
java.lang.OutOfMemoryError: unable to create new native thread

I need to extract the corresponding line(s) relating to the OutOfMemoryError and date! e.g. output should look like:

(date) (filename) (error)

2006 Nov 06 userQueue java.lang.OutOfMemoryError: unable to create new native thread

Currently I'm using something like this:

#!/bin/bash

date=`date | awk '{print $6 " " $2 " " $3}'`

filename=`sed -n "/$date/p" *.log* | awk '{print $7}'`

echo "Date is: " $date
echo "Filename is: " $filename

search=`sed "/$date/p" *.log* | grep OutOfMemory`

echo "Search Results: " $search

totalString=$date" "$filename" "$search
echo "Final Result: "$totalString > errorFiles

This of course doesn't work and gets every instance of either 2006 Nov 06 OR OutOfMemory.

I have also played around with simple oneliners like:

sed -e '/2006 Nov 06/b' -e '/OutOfMemoryError/b' -e d test.log > output
awk '{ if($1 == "2006" && $2 == "Nov" && $3 == "21") print}' test.log

I believe awk is the way to go. From the above example I should only have to search for the next pattern and output. But I'm unsure.

I hope some Linux crack could help with this. I'm sure someone with a more in-depth knowledge of awk or sed could solve this very simply.

Any help would be great. Thanks.

Hko · 11-24-2006, 06:33 AM

It's not entirely clear to me which lines exactly you're trying to filter out of the file. So here are few commands that I think/hope may help...

Code:

# Get all lines that start with $date and also
# contains "OutOfMemoryError". Just grep will do
# in that case.
#
grep "^$date.*OutOfMemory" log.txt

# Get all lines between (and including) the
# first that starts with $date until the first line
# after that which contains "OutOfMemoryError" 
# (possibly starts with a different date)
#
sed -n "/^$date/,/OutOfMemoryError/p" log.txt

# Get all lines between (and including) the
# first that starts with $date until the first line
# after that which starts with $date and
# also contains "OutOfMemoryError".
#
sed -n "/^$date/,/^$date.*OutOfMemoryError/p" log.txt

Just a minor tip: getting the date of today in that format is easier and executing faster this way:

Code:

date=`date +"%Y %b %d"`

Hope this helps.

GigerMalmensteen · 11-24-2006, 07:05 AM

Hko,

Thanks for your time, response and advice regarding getting the date.

I am trying to filter out the entire file. All I need is the line that is identified as having a date, that matches the current date, and is preceeded by the 'OutOfMemory' error string. Which in most cases will be 4 lines below the matched date line.

My primary problem is that when I make a search I get a list of all instances that have the date value.

The date field is not a unique identifier. The relationship between the date and error is the unique part!

Thanks

Hko · 11-24-2006, 10:49 AM

OK. If I understand correctly what you're trying to do, this would do the trick:

Code:

#!/bin/bash

date=`date +"%Y %b %d"`
string="OutOfMemoryError"
file="log.txt"

sed -n -e'/^'"$date"'/{' -eh -en -e\} \
    -e'/'"$string"'/{' -eH -eg -e\} \
    -e'/^'"$date"'.*\n.*'"$string"'/p' -eH \
    "$file"

Tinkster · 11-25-2006, 01:14 PM

Quote:

Originally Posted by GigerMalmensteen

I have to construct a maintenance program, part of this program is the interrogation of log files.

Ordinarily a grep or sed would sort me right out however this problem has a few other restrictions.

I have to initially get the current date from the system and then match this to entries in a log file. Not a problem, already done. However once I have located a matching line I then have to step over the next lines looking for another pattern and, if found, write these entries to a file. I can ONLY use either grep, sed or awk to do this. I believe awk will do it no problem however I am not familiar with all it's aspects. An example of the data may help:

test.log:

2006 Nov 06 18:01:25:538 GMT +1 userQueue - Job-18494
s/QueryLog]: located user queue on line 654 of system 5432
2006 Nov 06 18:03:25:538 GMT +1 userQueue - Job-18494
s/QueryLog]: located user queue on line 654 of system 5432
2006 Nov 06 18:04:25:538 GMT +1 userQueue - Job-18494
s/QueryLog]: located user queue on line 654 of system 5432
2006 Nov 06 18:06:25:538 GMT +1 userQueue - Job-18494
s/QueryLog]: located user queue on line 654 of system 5432
2006 Nov 06 18:07:25:538 GMT +1 userQueue - Job-18494
s/QueryLog]: located user queue on line 654 of system 5432
2006 Nov 06 18:08:26:179 GMT +1 userQueue - [Unknown
] - Severity: 2; Category: ; ExceptionCode: ; Message: unable to create new nati
ve thread; Parameters: <n/a>; Stack Trace: Job-18507 Error in userQueue
java.lang.OutOfMemoryError: unable to create new native thread

I need to extract the corresponding line(s) relating to the OutOfMemoryError and date! e.g. output should look like:

(date) (filename) (error)

2006 Nov 06 userQueue java.lang.OutOfMemoryError: unable to create new native thread

Currently I'm using something like this:

#!/bin/bash

date=`date | awk '{print $6 " " $2 " " $3}'`

filename=`sed -n "/$date/p" *.log* | awk '{print $7}'`

echo "Date is: " $date
echo "Filename is: " $filename

search=`sed "/$date/p" *.log* | grep OutOfMemory`

echo "Search Results: " $search

totalString=$date" "$filename" "$search
echo "Final Result: "$totalString > errorFiles

This of course doesn't work and gets every instance of either 2006 Nov 06 OR OutOfMemory.

I have also played around with simple oneliners like:

sed -e '/2006 Nov 06/b' -e '/OutOfMemoryError/b' -e d test.log > output
awk '{ if($1 == "2006" && $2 == "Nov" && $3 == "21") print}' test.log

I believe awk is the way to go. From the above example I should only have to search for the next pattern and output. But I'm unsure.

I hope some Linux crack could help with this. I'm sure someone with a more in-depth knowledge of awk or sed could solve this very simply.

Any help would be great. Thanks.

Have you mangled the log files lines like that on purpose, could you
make all stuff that belongs to one log-entry reside on one line?

Cheers,
Tink

firstfire · 11-26-2006, 11:56 PM

Hi!
One more tip: All your lines begins with 2006... so you can use it as a line delimiter and delete newlines at all.

Code:

...|tr -d '\n'|awk -F '200[0-9]' '/OutOfMemoryError/ {print}'|...

You have to test this code, because I can not do this at the moment.

igorc · 11-27-2006, 12:43 AM

Take a look in the getline command which is part of awk/gawk program.

jschiwal · 11-27-2006, 12:57 AM

Since you are looking for a pattern on a single line containing both the date and "Out of Memory", these could both be contained in a regular expression pattern. Just put a ".*" pattern inbetween the two patterns.

Or you could use grep twice: "grep 'pattern1' logfile | grep 'pattern2'" to produce an intersection of the two patterns.

There are three other things you can use with sed. The -n option will suppress output unless you use the print command. The -e option allows you to enter more then a single command ( As demonstrated by poster Hko above ). You can use brackets to use subpatterns inside // slashes to further fine tune the search. This may allow you to first select lines with the current date, and then create different files which filter different patterns.

If you have a gawk-doc package, you might want to install it. It includes the book "Gawk: Effective AWK Programming."

GigerMalmensteen · 11-27-2006, 02:56 AM

Thanks for all the feedback guys.

HKO your solution worked great on a single entry log file I tested, however sed died with a "sed: Memory allocation failed." error when tested on a real 8MB file. Any suggestions?

GigerMalmensteen · 11-28-2006, 08:54 AM

Just in case anyone was interested, an ugly solution I came up with is this:

#!/bin/bash
date=`date +"%Y %b %d"`
errorCode=$1
sed -n '/'"$date"'/,$p' ./data/5.log > tempfile
lineValue=`grep -n "$errorCode" tempfile | cut -d: -f 1 > lineValues`
count=`wc -w < lineValues`
grep -n "$date" tempfile | cut -d: -f 1 > dateValues

for((j=1;j<="$count";j++)); do
nOe=`sed "$j"'q;d' lineValues`
nOd=`sed "$j"'q;d' dateValues`
max=$nOe
min=$nOd
for ((i="$nOe";i>=0;i--)); do
if [ "$i" == "$max" ];then
error=`sed "$max"'q;d' tempfile`
fi
if [ "$i" == "$min" ];then
info=`sed "$min"'q;d' tempfile`
fi
done
output="$info"" ""$error"
done

Thanks for the help guys.

Hko · 11-28-2006, 03:23 PM

Quote:

Originally Posted by GigerMalmensteen

HKO your solution worked great on a single entry log file I tested, however sed died with a "sed: Memory allocation failed." error when tested on a real 8MB file. Any suggestions?

Here's a different, simpler approach. It will still read in entire files into memory, but it's not sed who has to that.

Code:

#!/bin/bash

date=`date +"%Y %b %d"`
string="OutOfMemoryError"
file="log.txt"

tac "$file" | sed -n '/'"$string"'/,/^'"$date"'/p' | tac

If the script above doesn't have the memory problem (I expect it doesn't, but I have tried it on large files), it's a much cleaner solution than your "ugly" one IMHO.

GigerMalmensteen · 11-29-2006, 03:57 AM

Hko,

Once again thanks for your response. Just to let you know 'tac' doesn't come as standard with the SunOS version I am using. So the elegant solution you proposed can't be used :?

I am working with limited resources.

osvaldomarques · 12-01-2006, 04:32 PM

Hi GigerMalmensteen,

As you have several steps to accomplish your task, I guess the best tool for your needs is awk: first, identify the messages of the day, second cat all the physical lines that compound the logical one, decide if it is to be reported and finally cut the slices you want to display.

Below I show you an script which does the above steps:

Code:

#!/bin/sh

DATE=`date +"%Y %b %d"`
DATE="2006 Nov 06" # to test your test.log

cat *.log | \
awk 'BEGIN { date = "'"$DATE"'" }

function check_output()
{
  #
  # check for error report on the assembled line
  #
  if ((ind = match(line, /Error in /)) != 0)
  {
    # ind points to the string "Error in "
    ind += 9 # go to post string
    # get the portion of the line which
    # contains the file and error message
    tmp = substr(line, ind)
    # get the separator between file and error
    ind = index(tmp, ":")
    file = substr(tmp, 1, ind - 1)
    error = substr(tmp, ind + 1)
    # printing the 3 fields separated by [TAB]
    printf("%s\t%s\t%s\n", date, file, error)
  }
  line = ""
}

{ # main loop
  if (index($0, date) != 0)
  {
    # if the line starts with the date
    # check to see if there is one
    # already assembled
    if (length(line) != 0)
      check_output()
    # Initialize a new line
    line = $0
  }
  else
  {
    # if the line does not start with
    # the date, check to see if there
    # is already a line in process. If
    # positive, cat the input to the
    # line. Otherwise, discard it.
    if (length(line) != 0)
      line = line " " $0
  }
}

END {
  # End of file, we could have an
  # assembled line; go and check it
  if (length(line) != 0)
    check_output()
}'

matthewg42 · 12-01-2006, 06:16 PM

It would all be very easy and elegant (not to mention pretty fast) to use perl:

Code:

#!/usr/bin/perl -w

use strict;

my $last_date = "unknown";
while(<>) {
    if ( /^(\d\d\d\d \w\w\w \d\d \d\d:\d\d:\d\d:\d\d\d \w\w\w ([+\-]\d)?)/ ) {
        $last_date = $1;
    }

    if ( /OutOfMemoryError/ ) {
        print "Out of memory detected at line $. - date = $last_date\n";
        next;
    }
}

You would run this on the logfiles by saving it to a file, e.g. "mylogscan", changing the mode of logscan to be executable:

Code:

chmod 755 mylogscan

And then executing with the filename of the log (or multiple logfiles if you like) as arguments to the program:

Code:

./mylogscan logfile1 logfile2 logfile3

A little Perl de-mystification might help to know how it's working:

use strict; just means complain a lot about potentially risky code. It's generally a good idea to use this.

Code:

while(<>) { ... }

The mysterious object here for Perl virgins is the <>. <SOMETHING> is Perl's way to read one line from the file handle SOMETHING. If you don't specify a SOMETHING, Perl opens files names as arguments to the script in turn (names in the array @ARGV), reads lines from them, closes them, opens the next file etc. If you don't specify any files as arguments to the script, Perl will read from standard input. Lines read in this manner get put in the variable $_. <> returns true until the end of possible input, at which point your while loop will terminate.

Code:

/^(\d\d\d\d \w\w\w \d\d \d\d:\d\d:\d\d:\d\d\d \w\w\w ([+\-]\d)?)/

This line is the most likely, in my opinion, to have Perl virgins running for the hills screaming. The bit between the slashes is a Perl style regular expression. \d mean "a digit", \w means a "word" character (letters and _). So this stuff between the slashes means "four digits, a space, four word characters, two digits, a colon etc. The [+\-] is a way of saying a + or a - character, the ? means "the previous bit, is optional". Brackets group expressions together and if there is a match, the matched values are assigned to $1 for the first set of brackets, $2 for the second set etc. By default, regular expressions are matched against the $_ variable, which is set to the line read from <> as described above. the /expression/ returns true if a match is found. Phew! In short all this means "look for something which looks like a date, and if you find it, put the matched value in $1, which we then save in the variable $last_date."

The rest is pretty self explanatory I think.

Perl's syntax is highly abbreviated for this sort of task because it's exactly the sort of thing that needs to be done a lot. It saves a lot of typing at the expense of scaring off newbies.

Perl eats gigabytes of log files for breakfast, and still has room left for more! Long live Perl!

chrism01 · 12-03-2006, 04:59 PM

Matthew42g, you ought to be able to shorten the regex with these operators I believe:

{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times

see http://perldoc.perl.org/perlre.html