transpose selected groups of rows of columns into multiple columns (arrays) with other columns retained to single row-group
3 Attachment(s)
Hello,
Wish to know how to do the following [have been trying awk / some program functions (datamash in Ubuntu), plus what has been commented regarding columns to rows and vice versa]: CSV file having N number of columns, and is sort numerically, with them in groups (e.g. consider column 1 listing the groups small to big), and other columns which change value within that group (an "array"). I am trying to output 1 line per group, and transpose the columns with changing values within a group to rows. Have attached a screen image and CSV (text) files of input and desired output. Please note - Have shortened the examples: each group can consist of 10's of rows, and a number of columns which are to be transposed into 1 row-multiple columns, and all the other columns which don't change in a group to just 1-row-1-column. In these files, have given 2 columns which need to be expanded (Level RM), and max of 4 lines in a group. The files I'm trying to change are few hundred lines and several columns with up to 10 (or so & varies group by group) rows which are to be transposed to columns. I am not yet proficient in Linux: have used it very basically several years ago: simple grep awk sed. If anyone can show me how to get the examples working that would great - I can then try expanding it from there to the larger files. |
I'd use tr with , as the delimiter. I also might just open it in a fancier spreadsheet program (I'd use Open Office) to do it.
|
Your example _input_csv.txt file has DOS convention line endings. If you want to process such files using Linux tools, then first convert them. e.g. 'fromdos < _input_csv.txt > input.csv'.
It is easy to combine the groups into a single row using awk. Code:
awk -F ',' '{if ($1==last) {printf "%s%s",",",$0} else {printf "%s%s","\n",$0};last=$1; next} END {printf "%s", "\n"}' _input.csv > _output.csv |
If only it was that simple - the OP seems to want to merge fields (not whole records) based on value(s) varying.
Ugh. Looks like one would have to provide the number of (max) fields per each one that varies (4 for original column 2, 3 for original column 4 ...) - either provide it all as input, or parse the entire file to calculate them all, then (re-)process the file. Bigger Ugh. |
It is not hard to delete or move entire columns in a spreadsheet. I am suggesting awk to relieve the tedious concatenation of the group data into one row, which I think is what the OP most wants to avoid.
|
Where does the source data come from?
If it's a database query which is exported to CSV, the transposing can be done in SQL, and will likely be more efficient. Otherwise, if you're stuck with CSV data, Miller will be able to slice/transpose/join/etc to get the data in the format desired. |
Will try suggestions.
Many thanks for the posts - will try this weekend with the suggestions and reply back.
FYI - boughtonp, the files are generated as XYZ (csv) files, inversion output from a windows-based program, so output is not already in a database type program. Trying to avoid spreadsheets as much as possible. |
I was curious and trying to get Miller to do this (because I've not actually used it before), but got nowhere with it. Just trying to follow the docs was painful.
So I had a look at GNU Datamash (which I've not previously encountered), and it took very little time to come up with this: Code:
$ datamash -t, -H --output-delimiter=$'\t' --full -g 1 count 2,4 collapse 2,4 < _Input_csv.txt Next step would be to identify the max values of the "count" columns, rename the "collapse" headers into (e.g. "RM1,RM2,RM3,RM4"), add any necessary blanks to the corresponding rows (i.e. suffix max-count commas), then output just the desired fields and switch output delimiter back to commas - all of which is something Awk can do. Looking at the other stuff Datamash can do, it looks like a pretty cool tool. |
Quote:
Quote:
|
Quote:
|
Quote:
Just a ton more flexibility. However...it depends on the consistency of the input data, how much, and how often this needs to be done. If it's a few times for a class project...bite the bullet, spend an hour or so in a spreadsheet and call it done. |
I think this works. It uses 2 arrays (level and rm) to store the variable number fields; I set the max to 4 with numlevel and numrm but it could be passed in. The data file I just called data; I removed the header by hand but of course you could easily skip it or if the app produces it, use it to set the number of fields.
Code:
#!/bin/bash |
Cheers - I was going to look at this after walking the mutt, but Kelly Slater tearing up Taehupo'o got in the way ... :p
Maybe over the weekend ... |
If you are going to script a solution, then you might as well add code to accept the data file name, do an initial run through the data file to find the maximum number of levels required and use that to print a header as well as format the lines.
Code:
#!/bin/bash |
Here's the Datamash+Awk version - haven't bothered making the level/rm field numbers dynamic - easy to plug that in if necessary.
Code:
filename="_Input_csv.txt" Code:
HPlt,Level1,Level2,Level3,Level4,Stat,RM1,RM2,RM3,RM4,X,Y |
All times are GMT -5. The time now is 05:09 PM. |