AWK Comparing two files

elonden · 12-09-2011, 12:33 AM

Hello,

I've been banging my head for a day now but it seems I'm doing something really wrong.

I have two files :

1. file called "pl" containing entries like:
64a240
64a340
64a440
64a540
64a640
64a740
64a840
64a8c0
64a940

2. file called "ns" containing entries like:
645b00,20:11:00:21:5a:2f:22:64
645b02,50:06:0b:00:00:c2:a2:2a
645b04,50:06:0b:00:00:c2:a2:3a
645b06,50:06:0b:00:00:c2:a2:32
645b07,50:06:0b:00:00:c2:a2:16
645c00,21:00:00:e0:8b:1e:9b:25
645d00,50:06:0e:80:05:b0:8c:56

What I want it to check if an entry in file #1 exists in field #1 of file #2 and if NOT either print the output or store it into an array.

Any feedback would be welcome.

What i have is the following (and obviously doesn't work :-)):

Code:

        awk 'BEGIN { FS=",";
                while ((getline < "ns") > 0)
                fcid=$1
                wwn=$2
                nsvar[fcid]=wwn
                close("ns")
       }        
              
        {

        while ((getline plvar < "pl") > 0) {

                if (plvar in nsvar) {
                                print("OK");
                                close(plvar)}
                else        {
                                print("Not OK");
                                close(plvar)}                   

                }

        }'

David the H. · 12-09-2011, 01:09 AM

If the given pattern only exists once on the line, how about this instead?

Code:

grep -v -f ./pl ./ns

Nominal Animal · 12-09-2011, 09:23 AM

elonden's awk script is pretty close, actually. Just switch the two files, and it becomes pretty simple.

Code:

awk -v keyfile=pl '
    BEGIN {
        FS = "," ;

        while ((getline < keyfile) > 0)
            if ($1 != "")
                keys[$1] = 1 ;

        close(keyfile)
    }

    ($1 in keys) { print $0 }

    ' ns

I added the semicolons so you can put it all on one line if you want.

The BEGIN rule reads in the keys from keyfile (pl), and creates an associative array out of them. The important bit is that you populate the keys in the keys array. (The value is irrelevant here, but often useful. You might use e.g. keys[$1]=++nkeys instead if you later extend the script and need to keep track of which key caused the record to be output.)

The rule ($1 in keys) is considered for each input record (line) of the ns file. It applies to all records where the first field matches one of the keys in the keys array. The body of the rule just prints the entire record.

If your input files may contain any newline convention (i.e. they may be produced in various operating systems, including Windows and old Macs), and you wish to retain that convention, extend the script a bit, and use GNU awk (gawk):

Code:

gawk -v keyfile=pl '
    BEGIN {
        RS = "[\n\r]+" ;
        FS = "," ;
        RT = "\n" ;

        while ((getline < keyfile) > 0)
            if ($1 != "")
                keys[$1] = 1 ;

        close(keyfile)
    }

    ($1 in keys) { printf("%s%s", $0, RT) }

    ' ns

The same script works for other awk variants too, but they do not retain the newline convention (output will always use UNIX newlines, "\n"), only gawk does. (GNU awk provides an automatic variable RT, which contains the pattern that matched the record separator RS for the current record. Other awk variants treat RT as a normal variable.)