Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have several csv files that contain anywhere from 70,000 to 150,000 rows of data (only one column) in integer form. I need to extract every two integers (pairing) that differs by between 57.9 and 58.1 from one another and place them in another file. All integers that do not have another integer that differs by said amount is unimportant. I would like to have a script to find these pairings within a csv file and another script to find these pairings between two csv files. Your help is greatly appreciated. Thank you for your time.
Where:
$file1 = the first file with data in it.
$file2 = the second file with data in it.
$file3 = the new file listing the pairs in csv form two columns.
If they're in the one file form... there is a way to do it fairly simply (the awk part is the same) but it is slipping my mind at the moment.
This assumes /bin/sh (or bash). If you're using [t]csh, you'll need to modify that. But it's doubtful that you will be. The $fileX things are the same except there's no $file2 in this case.
There are 70,000 cells in one column. I need to check each cell against every other cell for the above mentioned difference. They are not already paired up. So, cell A1 may be 58 apart from cell A125 or something like that. Thanks for the help and prompt response. I apologize for my lack of clarity.
You'll have to provide me with an example because I have no idea what you're talking about. For one thing, how do you know that A1 and A125 are related? Are we talking more than one column or not? What is the relation between these "fields" and how do you determine what is what? This is why I needed to clarify what the files looked like, although I may have not been clear when I gave my prototype examples.
This may be slightly more complicated than the above but probably not to the point where you need to go beyond the existant tools.
Now, when you say that all other values are unimportant, do you mean we toss all the values between them or do we need to check those values for pairings as we recurse through it all? The problem I'm having is that I'm unclear on the specs.
So if there is an integer in these 70,000 integers that doesn't differ from another integer by 57.9 to 58.1, then I don't need it. So, every integer that does partner with any other integer should be paired with that other integer in A, B format if possible. I know I'm confusing, but english really is my first language
Does the order of the values matter? Can a value be used more than once? In the two file example, does appending the file change the results you'd expect from interleaving it?
Note: The following code assumes that once you've used a value you don't want it to be used again. Remove the line with the =99999.999999 in it, if you want values to be reused. Save the following as compare.awk
Code:
# This is a function to compare each new value with all the others
# and determine if it's within the range. If it is, print out the
# value pair and return zero, else return 1 and add it to the array.
# This should change the array on match to avoid duplicate matches.
function compare(value) {
for(count=0;count<len;count++) {
if(vals[count] > value) {
diff=vals[count]-value;
}
else {
diff=value-vals[count];
}
if(diff>57.9 && diff<58.1) {
print vals[count] "," value
# NOTE: This value must be unable to ever
# be within valid range.
vals[count]=99999.999999;
return 0;
}
}
return 1;
}
BEGIN{
len=0;
}
{
n=1;
while(1) {
if($n) {
if(compare($n)) {
vals[len]=$n;
len++;
}
n++;
}
else {
break;
}
}
}
Note: This isn't the cleanest code or the most memory efficient but it should work. You're welcome to find a "better way" if you want.
You call it for one file with:
Code:
awk -f compare.awk $file1 > $file3
For two files, it depends on if interleaving matters. If it does,
Code:
paste $file1 $file2 | awk -f compare.awk > $file3
If it doesn't
Code:
awk -f compare.awk $file1 $file2 > $file3
$file[1-3] are the same as listed above. Again, $file3 is matched pairs in a two column csv file.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.