help extracting data from csv file

willinusf · 10-26-2006, 09:23 PM

I have several csv files that contain anywhere from 70,000 to 150,000 rows of data (only one column) in integer form. I need to extract every two integers (pairing) that differs by between 57.9 and 58.1 from one another and place them in another file. All integers that do not have another integer that differs by said amount is unimportant. I would like to have a script to find these pairings within a csv file and another script to find these pairings between two csv files. Your help is greatly appreciated. Thank you for your time.

Will

frob23 · 10-26-2006, 09:55 PM

An example data set here would help a lot.

Are we talking

Code:

A
A
B
B
...etc

For the one file set and

Code:

file1     file2
  A         A
  B         B
  C         C
     etc

For the two files or what?

And you don't care if the difference is greater or less than those values but only in that range... right?

willinusf · 10-26-2006, 10:06 PM

Correct. Has to be in that range. And the files are just like that with integers to 4 decimal places. Thanks.

frob23 · 10-26-2006, 10:10 PM

Note that the two file example is actually fairly simple.

Code:

paste $file1 $file2 | awk '{if ($1 > $2) {diff=$1-$2} else {diff=$2-$1}; if (diff>57.9 && diff<58.1) {print $1 "," $2}}' > $file3

Where:
$file1 = the first file with data in it.
$file2 = the second file with data in it.
$file3 = the new file listing the pairs in csv form two columns.

If they're in the one file form... there is a way to do it fairly simply (the awk part is the same) but it is slipping my mind at the moment.

Edit:

Ah, it's a classic sed one-liner.

Code:

sed '$!N;s/\n/ /' $file1 | awk '{if ($1 > $2) {diff=$1-$2} else {diff=$2-$1}; if (diff>57.9 && diff<58.1) {print $1 "," $2}}' > $file3

This assumes /bin/sh (or bash). If you're using [t]csh, you'll need to modify that. But it's doubtful that you will be. The $fileX things are the same except there's no $file2 in this case.

Edit2: Or you can use paste again

Code:

paste -s -d '\t\n' $file1 | awk '{if ($1 > $2) {diff=$1-$2} else {diff=$2-$1}; if (diff>57.9 && diff<58.1) {print $1 "," $2}}' > $file3

I knew that paste supported it but I needed to crack the man-page to remember how.

willinusf · 10-27-2006, 12:37 PM

I need to clarify the file formats:

There are 70,000 cells in one column. I need to check each cell against every other cell for the above mentioned difference. They are not already paired up. So, cell A1 may be 58 apart from cell A125 or something like that. Thanks for the help and prompt response. I apologize for my lack of clarity.

Will

frob23 · 10-27-2006, 07:58 PM

You'll have to provide me with an example because I have no idea what you're talking about. For one thing, how do you know that A1 and A125 are related? Are we talking more than one column or not? What is the relation between these "fields" and how do you determine what is what? This is why I needed to clarify what the files looked like, although I may have not been clear when I gave my prototype examples.

This may be slightly more complicated than the above but probably not to the point where you need to go beyond the existant tools.

frob23 · 10-27-2006, 08:00 PM

Now, when you say that all other values are unimportant, do you mean we toss all the values between them or do we need to check those values for pairings as we recurse through it all? The problem I'm having is that I'm unclear on the specs.

willinusf · 10-27-2006, 08:25 PM

So if there is an integer in these 70,000 integers that doesn't differ from another integer by 57.9 to 58.1, then I don't need it. So, every integer that does partner with any other integer should be paired with that other integer in A, B format if possible. I know I'm confusing, but english really is my first language

willinusf · 10-27-2006, 08:26 PM

And yes, check all integers for partners.

frob23 · 10-27-2006, 08:59 PM

Does the order of the values matter? Can a value be used more than once? In the two file example, does appending the file change the results you'd expect from interleaving it?

For example would:

Code:

file1   file2
file1   file2
file1   file2
file1   file2
file1   file2

give different results from:

Code:

file1
file1
file1
file1
file2
file2
file2
file2

Assuming each fileX was replaced by the corresponding values from the file?

frob23 · 10-27-2006, 09:10 PM

Note: The following code assumes that once you've used a value you don't want it to be used again. Remove the line with the =99999.999999 in it, if you want values to be reused. Save the following as compare.awk

Code:

# This is a function to compare each new value with all the others
# and determine if it's within the range.  If it is, print out the
# value pair and return zero, else return 1 and add it to the array.
# This should change the array on match to avoid duplicate matches.
function compare(value) {
        for(count=0;count<len;count++) {
                if(vals[count] > value) {
                        diff=vals[count]-value;
                }
                else {
                        diff=value-vals[count];
                }
                if(diff>57.9 && diff<58.1) {
                        print vals[count] "," value
                        # NOTE:  This value must be unable to ever
                        # be within valid range.
                        vals[count]=99999.999999;
                        return 0;
                }
        }
        return 1;
}

BEGIN{
        len=0;
}

{
        n=1;
        while(1) {
                if($n) {
                        if(compare($n)) {
                                vals[len]=$n;
                                len++;
                        }
                        n++;
                }
                else {
                        break;
                }
        }
}

Note: This isn't the cleanest code or the most memory efficient but it should work. You're welcome to find a "better way" if you want.

You call it for one file with:

Code:

awk -f compare.awk $file1 > $file3

For two files, it depends on if interleaving matters. If it does,

Code:

paste $file1 $file2 | awk -f compare.awk > $file3

If it doesn't

Code:

awk -f compare.awk $file1 $file2 > $file3

$file[1-3] are the same as listed above. Again, $file3 is matched pairs in a two column csv file.