LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 10-26-2006, 09:23 PM   #1
willinusf
LQ Newbie
 
Registered: Jul 2004
Posts: 12

Rep: Reputation: 0
help extracting data from csv file


I have several csv files that contain anywhere from 70,000 to 150,000 rows of data (only one column) in integer form. I need to extract every two integers (pairing) that differs by between 57.9 and 58.1 from one another and place them in another file. All integers that do not have another integer that differs by said amount is unimportant. I would like to have a script to find these pairings within a csv file and another script to find these pairings between two csv files. Your help is greatly appreciated. Thank you for your time.

Will
 
Old 10-26-2006, 09:55 PM   #2
frob23
Senior Member
 
Registered: Jan 2004
Location: Roughly 29.467N / 81.206W
Distribution: OpenBSD, Debian, FreeBSD
Posts: 1,450

Rep: Reputation: 48
An example data set here would help a lot.

Are we talking
Code:
A
A
B
B
...etc
For the one file set and
Code:
file1     file2
  A         A
  B         B
  C         C
     etc
For the two files or what?

And you don't care if the difference is greater or less than those values but only in that range... right?
 
Old 10-26-2006, 10:06 PM   #3
willinusf
LQ Newbie
 
Registered: Jul 2004
Posts: 12

Original Poster
Rep: Reputation: 0
Correct. Has to be in that range. And the files are just like that with integers to 4 decimal places. Thanks.
 
Old 10-26-2006, 10:10 PM   #4
frob23
Senior Member
 
Registered: Jan 2004
Location: Roughly 29.467N / 81.206W
Distribution: OpenBSD, Debian, FreeBSD
Posts: 1,450

Rep: Reputation: 48
Note that the two file example is actually fairly simple.

Code:
paste $file1 $file2 | awk '{if ($1 > $2) {diff=$1-$2} else {diff=$2-$1}; if (diff>57.9 && diff<58.1) {print $1 "," $2}}' > $file3
Where:
$file1 = the first file with data in it.
$file2 = the second file with data in it.
$file3 = the new file listing the pairs in csv form two columns.

If they're in the one file form... there is a way to do it fairly simply (the awk part is the same) but it is slipping my mind at the moment.

Edit:

Ah, it's a classic sed one-liner.

Code:
sed '$!N;s/\n/ /' $file1 | awk '{if ($1 > $2) {diff=$1-$2} else {diff=$2-$1}; if (diff>57.9 && diff<58.1) {print $1 "," $2}}' > $file3
This assumes /bin/sh (or bash). If you're using [t]csh, you'll need to modify that. But it's doubtful that you will be. The $fileX things are the same except there's no $file2 in this case.

Edit2: Or you can use paste again
Code:
paste -s -d '\t\n' $file1 | awk '{if ($1 > $2) {diff=$1-$2} else {diff=$2-$1}; if (diff>57.9 && diff<58.1) {print $1 "," $2}}' > $file3
I knew that paste supported it but I needed to crack the man-page to remember how.

Last edited by frob23; 10-26-2006 at 10:22 PM.
 
Old 10-27-2006, 12:37 PM   #5
willinusf
LQ Newbie
 
Registered: Jul 2004
Posts: 12

Original Poster
Rep: Reputation: 0
I need to clarify the file formats:

There are 70,000 cells in one column. I need to check each cell against every other cell for the above mentioned difference. They are not already paired up. So, cell A1 may be 58 apart from cell A125 or something like that. Thanks for the help and prompt response. I apologize for my lack of clarity.

Will
 
Old 10-27-2006, 07:58 PM   #6
frob23
Senior Member
 
Registered: Jan 2004
Location: Roughly 29.467N / 81.206W
Distribution: OpenBSD, Debian, FreeBSD
Posts: 1,450

Rep: Reputation: 48
You'll have to provide me with an example because I have no idea what you're talking about. For one thing, how do you know that A1 and A125 are related? Are we talking more than one column or not? What is the relation between these "fields" and how do you determine what is what? This is why I needed to clarify what the files looked like, although I may have not been clear when I gave my prototype examples.

This may be slightly more complicated than the above but probably not to the point where you need to go beyond the existant tools.
 
Old 10-27-2006, 08:00 PM   #7
frob23
Senior Member
 
Registered: Jan 2004
Location: Roughly 29.467N / 81.206W
Distribution: OpenBSD, Debian, FreeBSD
Posts: 1,450

Rep: Reputation: 48
Now, when you say that all other values are unimportant, do you mean we toss all the values between them or do we need to check those values for pairings as we recurse through it all? The problem I'm having is that I'm unclear on the specs.
 
Old 10-27-2006, 08:25 PM   #8
willinusf
LQ Newbie
 
Registered: Jul 2004
Posts: 12

Original Poster
Rep: Reputation: 0
So if there is an integer in these 70,000 integers that doesn't differ from another integer by 57.9 to 58.1, then I don't need it. So, every integer that does partner with any other integer should be paired with that other integer in A, B format if possible. I know I'm confusing, but english really is my first language
 
Old 10-27-2006, 08:26 PM   #9
willinusf
LQ Newbie
 
Registered: Jul 2004
Posts: 12

Original Poster
Rep: Reputation: 0
And yes, check all integers for partners.
 
Old 10-27-2006, 08:59 PM   #10
frob23
Senior Member
 
Registered: Jan 2004
Location: Roughly 29.467N / 81.206W
Distribution: OpenBSD, Debian, FreeBSD
Posts: 1,450

Rep: Reputation: 48
Does the order of the values matter? Can a value be used more than once? In the two file example, does appending the file change the results you'd expect from interleaving it?

For example would:
Code:
file1   file2
file1   file2
file1   file2
file1   file2
file1   file2
give different results from:
Code:
file1
file1
file1
file1
file2
file2
file2
file2
Assuming each fileX was replaced by the corresponding values from the file?
 
Old 10-27-2006, 09:10 PM   #11
frob23
Senior Member
 
Registered: Jan 2004
Location: Roughly 29.467N / 81.206W
Distribution: OpenBSD, Debian, FreeBSD
Posts: 1,450

Rep: Reputation: 48
Note: The following code assumes that once you've used a value you don't want it to be used again. Remove the line with the =99999.999999 in it, if you want values to be reused. Save the following as compare.awk

Code:
# This is a function to compare each new value with all the others
# and determine if it's within the range.  If it is, print out the
# value pair and return zero, else return 1 and add it to the array.
# This should change the array on match to avoid duplicate matches.
function compare(value) {
        for(count=0;count<len;count++) {
                if(vals[count] > value) {
                        diff=vals[count]-value;
                }
                else {
                        diff=value-vals[count];
                }
                if(diff>57.9 && diff<58.1) {
                        print vals[count] "," value
                        # NOTE:  This value must be unable to ever
                        # be within valid range.
                        vals[count]=99999.999999;
                        return 0;
                }
        }
        return 1;
}

BEGIN{
        len=0;
}

{
        n=1;
        while(1) {
                if($n) {
                        if(compare($n)) {
                                vals[len]=$n;
                                len++;
                        }
                        n++;
                }
                else {
                        break;
                }
        }
}
Note: This isn't the cleanest code or the most memory efficient but it should work. You're welcome to find a "better way" if you want.

You call it for one file with:
Code:
awk -f compare.awk $file1 > $file3
For two files, it depends on if interleaving matters. If it does,
Code:
paste $file1 $file2 | awk -f compare.awk > $file3
If it doesn't
Code:
awk -f compare.awk $file1 $file2 > $file3
$file[1-3] are the same as listed above. Again, $file3 is matched pairs in a two column csv file.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extracting MySQL data from raw files cs-cam Linux - Software 1 06-12-2006 11:22 PM
Culling Data from a CSV file to output in excel jterr02 Programming 2 05-19-2006 04:58 AM
csv to fixed-length file roballen Programming 0 03-11-2004 03:12 AM
CSV File AMMullan Programming 2 11-10-2003 12:49 AM
Extracting data from broken drive darin3200 Linux - Software 1 07-12-2003 01:34 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 08:19 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration