transpose selected groups of rows of columns into multiple columns (arrays) with other columns retained to single row-group
Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
transpose selected groups of rows of columns into multiple columns (arrays) with other columns retained to single row-group
Hello,
Wish to know how to do the following [have been trying awk / some program functions (datamash in Ubuntu), plus what has been commented regarding columns to rows and vice versa]:
CSV file having N number of columns, and is sort numerically, with them in groups (e.g. consider column 1 listing the groups small to big), and other columns which change value within that group (an "array"). I am trying to output 1 line per group, and transpose the columns with changing values within a group to rows. Have attached a screen image and CSV (text) files of input and desired output.
Please note - Have shortened the examples: each group can consist of 10's of rows, and a number of columns which are to be transposed into 1 row-multiple columns, and all the other columns which don't change in a group to just 1-row-1-column.
In these files, have given 2 columns which need to be expanded (Level RM), and max of 4 lines in a group. The files I'm trying to change are few hundred lines and several columns with up to 10 (or so & varies group by group) rows which are to be transposed to columns.
I am not yet proficient in Linux: have used it very basically several years ago: simple grep awk sed.
If anyone can show me how to get the examples working that would great - I can then try expanding it from there to the larger files.
Your example _input_csv.txt file has DOS convention line endings. If you want to process such files using Linux tools, then first convert them. e.g. 'fromdos < _input_csv.txt > input.csv'.
It is easy to combine the groups into a single row using awk.
If only it was that simple - the OP seems to want to merge fields (not whole records) based on value(s) varying.
Ugh.
Looks like one would have to provide the number of (max) fields per each one that varies (4 for original column 2, 3 for original column 4 ...) - either provide it all as input, or parse the entire file to calculate them all, then (re-)process the file.
Bigger Ugh.
It is not hard to delete or move entire columns in a spreadsheet. I am suggesting awk to relieve the tedious concatenation of the group data into one row, which I think is what the OP most wants to avoid.
Many thanks for the posts - will try this weekend with the suggestions and reply back.
FYI - boughtonp, the files are generated as XYZ (csv) files, inversion output from a windows-based program, so output is not already in a database type program.
Trying to avoid spreadsheets as much as possible.
Last edited by TempleStone4510!; 08-18-2022 at 10:26 AM.
I was curious and trying to get Miller to do this (because I've not actually used it before), but got nowhere with it. Just trying to follow the docs was painful.
So I had a look at GNU Datamash (which I've not previously encountered), and it took very little time to come up with this:
That's not the desired output format yet, but it does have all the necessary data into individual rows.
Next step would be to identify the max values of the "count" columns, rename the "collapse" headers into (e.g. "RM1,RM2,RM3,RM4"), add any necessary blanks to the corresponding rows (i.e. suffix max-count commas), then output just the desired fields and switch output delimiter back to commas - all of which is something Awk can do.
Looking at the other stuff Datamash can do, it looks like a pretty cool tool.
Many thanks for the posts - will try this weekend with the suggestions and reply back. FYI - boughtonp, the files are generated as XYZ (csv) files, inversion output from a windows-based program, so output is not already in a database type program.
Would help if you said what this 'windows-based' program is...may be some options there.
Quote:
Trying to avoid spreadsheets as much as possible.
If the data is consistent, I'd personally create a MySQL database table, use the built-in CSV importer, then I could output the data in a LOT of different ways. Along with keeping archival data, it gives a ton of flexibility later on, if this is something you have to do often.
If the data is consistent, I'd personally create a MySQL database table, use the built-in CSV importer, then I could output the data in a LOT of different ways. Along with keeping archival data, it gives a ton of flexibility later on, if this is something you have to do often.
That's definitely something to consider - even if they don't have/want a MySQL server running, it wouldn't take much to write a simple SQLite import script.
That's definitely something to consider - even if they don't have/want a MySQL server running, it wouldn't take much to write a simple SQLite import script.
Agreed; I'd use MySQL, but only because I'm more familiar with it, but there is a ton out there on how to import things into a DB. And a script to read things/output wouldn't be difficult, even in bash, but you'd still be able to use python, perl, or any other language as well. And also gives you the ability to use ODBC into a spreadsheet, or any number of web-based tools to visualize data.
Just a ton more flexibility. However...it depends on the consistency of the input data, how much, and how often this needs to be done. If it's a few times for a class project...bite the bullet, spend an hour or so in a spreadsheet and call it done.
I think this works. It uses 2 arrays (level and rm) to store the variable number fields; I set the max to 4 with numlevel and numrm but it could be passed in. The data file I just called data; I removed the header by hand but of course you could easily skip it or if the app produces it, use it to set the number of fields.
Code:
#!/bin/bash
# numlevel and numrm are the max number of each field, could be passed in as an arg.
awk -F, \
'BEGIN {last="";numlevel=4;numrm=4;i=1;j=1;}
function printline() {
printf ("%s",hpit);
# dump the num array
for (i=1;i<=numlevel;i++) {
if (level[i]=="") {
printf(",*");
} else {
printf(",%s", level[i]);
}
}
printf (",%s",stat);
# dump the rm array
for (i=1;i<=numrm;i++) {
if (rm[i]=="") {
printf(",*");
} else {
printf(",%s", rm[i]);
}
}
printf (",%s,%s",xval,yval);
printf ("\n");
# clear our 2 arrays
delete (level);
delete (rm);
}
# if we see a change on field one, dump the line and reset, but also the first record we should not print
{ FS=","; if (($1!=last) && (NR!=1)) {printline(); delete (level); delete(rm); i=1; j=1;}
hpit=$1;
level[i++]=$2;
stat=$3;
rm[j++]=$4;
xval=$5;
yval=$6;
last=$1;
}
# output the last line
END {printline(); printf ("\n");}' data
Last edited by elgrandeperro; 08-19-2022 at 01:58 PM.
If you are going to script a solution, then you might as well add code to accept the data file name, do an initial run through the data file to find the maximum number of levels required and use that to print a header as well as format the lines.
Code:
#!/bin/bash
if [[ -z $1 ]]; then echo "Usage: $0 <data file>"; exit; fi
if [[ -f $1 ]]; then data=$1; else echo "No data file found. Exiting."; exit; fi
# Determine maximum number of levels
numlevel=$(awk -F ',' ' \
{if (NR!=1) level[$1]++} \
END {for (i in level) \
if (level[i] > numlevel) {numlevel=level[i]}; \
print numlevel}' $data)
# numlevel and numrm are the max number of each field, could be passed in as an arg.
awk -F ',' -v numlevel=$numlevel '
BEGIN {last="";numrm=numlevel;i=1;j=1
printf("%s", "HPit")
for (i=1;i<=numlevel;i++) {printf(",%s%s", "Level", i)}
printf(",%s", "Stat")
for (i=1;i<=numlevel;i++) {printf(",%s%s", "RM", i)}
printf(",%s,%s%s", "X", "Y", "\n")
}
function printline() {
printf ("%s",hpit);
# dump the num array
for (i=1;i<=numlevel;i++) {
if (level[i]=="") {
printf(",*");
} else {
printf(",%s", level[i]);
}
}
printf (",%s",stat);
# dump the rm array
for (i=1;i<=numrm;i++) {
if (rm[i]=="") {
printf(",*");
} else {
printf(",%s", rm[i]);
}
}
printf (",%s,%s",xval,yval);
printf ("\n");
# clear our 2 arrays
delete (level);
delete (rm);
}
# if we see a change on field one, dump the line and reset, but also the first record we should not print
{ if (($1!=last) && (NR>2)) {printline(); delete (level); delete(rm); i=1; j=1; }}
{ if (NR>1) { hpit=$1;
level[i++]=$2;
stat=$3;
rm[j++]=$4;
xval=$5;
yval=$6;
last=$1; }
}
# output the last line
END {printline(); printf ("\n");}' $data
PS - I have taken the liberty of changing the logic slightly to avoid printing the first record.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.