transpose selected groups of rows of columns into multiple columns (arrays) with other columns retained to single row-group

TempleStone4510! · 08-17-2022, 04:59 AM

Hello,

Wish to know how to do the following [have been trying awk / some program functions (datamash in Ubuntu), plus what has been commented regarding columns to rows and vice versa]:

CSV file having N number of columns, and is sort numerically, with them in groups (e.g. consider column 1 listing the groups small to big), and other columns which change value within that group (an "array"). I am trying to output 1 line per group, and transpose the columns with changing values within a group to rows. Have attached a screen image and CSV (text) files of input and desired output.

Please note - Have shortened the examples: each group can consist of 10's of rows, and a number of columns which are to be transposed into 1 row-multiple columns, and all the other columns which don't change in a group to just 1-row-1-column.

In these files, have given 2 columns which need to be expanded (Level RM), and max of 4 lines in a group. The files I'm trying to change are few hundred lines and several columns with up to 10 (or so & varies group by group) rows which are to be transposed to columns.

I am not yet proficient in Linux: have used it very basically several years ago: simple grep awk sed.
If anyone can show me how to get the examples working that would great - I can then try expanding it from there to the larger files.

RandomTroll · 08-17-2022, 05:42 AM

I'd use tr with , as the delimiter. I also might just open it in a fancier spreadsheet program (I'd use Open Office) to do it.

allend · 08-18-2022, 12:32 AM

Your example _input_csv.txt file has DOS convention line endings. If you want to process such files using Linux tools, then first convert them. e.g. 'fromdos < _input_csv.txt > input.csv'.

It is easy to combine the groups into a single row using awk.

Code:

awk -F ',' '{if ($1==last) {printf "%s%s",",",$0} else {printf "%s%s","\n",$0};last=$1; next} END {printf "%s", "\n"}' _input.csv > _output.csv

As @RandomTroll suggests, using a spreadsheet program is probably the easiest way to reformat to your desired final output.

syg00 · 08-18-2022, 05:35 AM

If only it was that simple - the OP seems to want to merge fields (not whole records) based on value(s) varying.
Ugh.

Looks like one would have to provide the number of (max) fields per each one that varies (4 for original column 2, 3 for original column 4 ...) - either provide it all as input, or parse the entire file to calculate them all, then (re-)process the file.
Bigger Ugh.

allend · 08-18-2022, 08:21 AM

It is not hard to delete or move entire columns in a spreadsheet. I am suggesting awk to relieve the tedious concatenation of the group data into one row, which I think is what the OP most wants to avoid.

boughtonp · 08-18-2022, 08:58 AM

Where does the source data come from?

If it's a database query which is exported to CSV, the transposing can be done in SQL, and will likely be more efficient.

Otherwise, if you're stuck with CSV data, Miller will be able to slice/transpose/join/etc to get the data in the format desired.

TempleStone4510! · 08-18-2022, 10:24 AM

Many thanks for the posts - will try this weekend with the suggestions and reply back.

FYI - boughtonp, the files are generated as XYZ (csv) files, inversion output from a windows-based program, so output is not already in a database type program.

Trying to avoid spreadsheets as much as possible.

boughtonp · 08-18-2022, 01:03 PM

I was curious and trying to get Miller to do this (because I've not actually used it before), but got nowhere with it. Just trying to follow the docs was painful.

So I had a look at GNU Datamash (which I've not previously encountered), and it took very little time to come up with this:

Code:

$ datamash -t, -H --output-delimiter=$'\t' --full -g 1 count 2,4 collapse 2,4 < _Input_csv.txt
HPlt    Level   Stat    RM      X       Y       count(Level)    count(RM)       collapse(Level) collapse(RM)
1001.1  108.339 1762.5  8.54    700050.21       8000022.24      3       3       108.339,72.177,-288.725 8.54,8.98,8.16
1023    107.919 1790    6.99    7000340.79      8000041.51      1       1       107.919 6.99
1040    108.129 1812.5  9.61    7000504.18      8000246.15      3       3       108.129,72.01,30.649    9.61,8.46,8.37
1121    108.968 1837.5  9.87    7002001.33      8000489.73      4       4       108.968,72.251,30.57,-16.819    9.87,8.42,7.92,8.25
2102.525        109.807 1862.5  7.7     7004300.61      8001002.77      2       2       109.807,72.489  7.7,7.06

That's not the desired output format yet, but it does have all the necessary data into individual rows.

Next step would be to identify the max values of the "count" columns, rename the "collapse" headers into (e.g. "RM1,RM2,RM3,RM4"), add any necessary blanks to the corresponding rows (i.e. suffix max-count commas), then output just the desired fields and switch output delimiter back to commas - all of which is something Awk can do.

Looking at the other stuff Datamash can do, it looks like a pretty cool tool.

TB0ne · 08-18-2022, 01:21 PM

Quote:

Originally Posted by TempleStone4510!

Many thanks for the posts - will try this weekend with the suggestions and reply back. FYI - boughtonp, the files are generated as XYZ (csv) files, inversion output from a windows-based program, so output is not already in a database type program.

Would help if you said what this 'windows-based' program is...may be some options there.

Quote:

Trying to avoid spreadsheets as much as possible.

If the data is consistent, I'd personally create a MySQL database table, use the built-in CSV importer, then I could output the data in a LOT of different ways. Along with keeping archival data, it gives a ton of flexibility later on, if this is something you have to do often.

boughtonp · 08-18-2022, 05:06 PM

Quote:

Originally Posted by TB0ne

If the data is consistent, I'd personally create a MySQL database table, use the built-in CSV importer, then I could output the data in a LOT of different ways. Along with keeping archival data, it gives a ton of flexibility later on, if this is something you have to do often.

That's definitely something to consider - even if they don't have/want a MySQL server running, it wouldn't take much to write a simple SQLite import script.

TB0ne · 08-18-2022, 05:16 PM

Quote:

Originally Posted by boughtonp

That's definitely something to consider - even if they don't have/want a MySQL server running, it wouldn't take much to write a simple SQLite import script.

Agreed; I'd use MySQL, but only because I'm more familiar with it, but there is a ton out there on how to import things into a DB. And a script to read things/output wouldn't be difficult, even in bash, but you'd still be able to use python, perl, or any other language as well. And also gives you the ability to use ODBC into a spreadsheet, or any number of web-based tools to visualize data.

Just a ton more flexibility. However...it depends on the consistency of the input data, how much, and how often this needs to be done. If it's a few times for a class project...bite the bullet, spend an hour or so in a spreadsheet and call it done.

elgrandeperro · 08-18-2022, 11:18 PM

I think this works. It uses 2 arrays (level and rm) to store the variable number fields; I set the max to 4 with numlevel and numrm but it could be passed in. The data file I just called data; I removed the header by hand but of course you could easily skip it or if the app produces it, use it to set the number of fields.

Code:

#!/bin/bash
# numlevel and numrm are the max number of each field, could be passed in as an arg.
awk -F, \
 'BEGIN {last="";numlevel=4;numrm=4;i=1;j=1;}
function printline() {
 printf ("%s",hpit);
 # dump the num array
 for (i=1;i<=numlevel;i++) {
	if (level[i]=="") {
		printf(",*");
	} else {
		printf(",%s", level[i]);  
	}
 }
 printf (",%s",stat);
 # dump the rm array
 for (i=1;i<=numrm;i++) {
	if (rm[i]=="") {
		printf(",*");
	} else {
		printf(",%s", rm[i]);  
	}
 }
 printf (",%s,%s",xval,yval);
 printf ("\n");
 # clear our 2 arrays
 delete (level);
 delete (rm);
}
 # if we see a change on field one, dump the line and reset, but also the first record we should not print
 { FS=","; if (($1!=last) && (NR!=1)) {printline(); delete (level); delete(rm); i=1; j=1;} 
 hpit=$1;
 level[i++]=$2;
 stat=$3;
 rm[j++]=$4;
 xval=$5;
 yval=$6;
 last=$1;
 }
 # output the last line
 END {printline(); printf ("\n");}' data

syg00 · 08-19-2022, 12:01 AM

Cheers - I was going to look at this after walking the mutt, but Kelly Slater tearing up Taehupo'o got in the way ...

Maybe over the weekend ...

allend · 08-20-2022, 02:46 AM

If you are going to script a solution, then you might as well add code to accept the data file name, do an initial run through the data file to find the maximum number of levels required and use that to print a header as well as format the lines.

Code:

#!/bin/bash

if [[ -z $1 ]]; then echo "Usage: $0 <data file>"; exit; fi
if [[ -f $1 ]]; then  data=$1; else echo  "No data file found. Exiting."; exit; fi

# Determine maximum number of levels
numlevel=$(awk -F ',' ' \
  {if (NR!=1) level[$1]++} \
  END {for (i in level) \
    if (level[i] > numlevel) {numlevel=level[i]}; \
      print numlevel}' $data)

# numlevel and numrm are the max number of each field, could be passed in as an arg.
awk -F ',' -v numlevel=$numlevel '
BEGIN {last="";numrm=numlevel;i=1;j=1
  printf("%s", "HPit")
  for (i=1;i<=numlevel;i++) {printf(",%s%s", "Level", i)}
  printf(",%s", "Stat")
  for (i=1;i<=numlevel;i++) {printf(",%s%s", "RM", i)}
  printf(",%s,%s%s", "X", "Y", "\n")
}
function printline() {
 printf ("%s",hpit);
 # dump the num array
 for (i=1;i<=numlevel;i++) {
        if (level[i]=="") {
                printf(",*");
        } else {
                printf(",%s", level[i]);
        }
 }
 printf (",%s",stat);
 # dump the rm array
 for (i=1;i<=numrm;i++) {
        if (rm[i]=="") {
                printf(",*");
        } else {
                printf(",%s", rm[i]);
        }
 }
 printf (",%s,%s",xval,yval);
 printf ("\n");
 # clear our 2 arrays
 delete (level);
 delete (rm);
}
 # if we see a change on field one, dump the line and reset, but also the first record we should not print
 { if (($1!=last) && (NR>2)) {printline(); delete (level); delete(rm); i=1; j=1; }}
 { if (NR>1) { hpit=$1;
 level[i++]=$2;
 stat=$3;
 rm[j++]=$4;
 xval=$5;
 yval=$6;
 last=$1; }
 }
 # output the last line
 END {printline(); printf ("\n");}' $data

PS - I have taken the liberty of changing the logic slightly to avoid printing the first record.

boughtonp · 08-21-2022, 01:56 PM

Here's the Datamash+Awk version - haven't bothered making the level/rm field numbers dynamic - easy to plug that in if necessary.

Code:

filename="_Input_csv.txt"

datamash -t, -H --output-delimiter=$'\t' -g 1,3,5,6 count 2,4 collapse 2,4 < "$filename" \
| awk -vOFS=',' 'BEGIN {num_level=4;num_rm=4}
NR==1 {
  for(r=1;r<=num_level;r++) h_level = h_level",Level"r
  for(r=1;r<=num_rm;r++) h_rm = h_rm",RM"r
  print "HPlt" h_level ",Stat" h_rm ",X,Y"
}
NR>1 {
  pad_level="";pad_rm="";
  for(r=$5;r<num_level;r++) pad_level = pad_level",*";
  for(r=$6;r<num_rm;r++) pad_rm = pad_rm",*";
  print $1,$7 pad_level,$2,$8 pad_rm,$3,$4
}'

Result matches the _Output_csv.txt:

Code:

HPlt,Level1,Level2,Level3,Level4,Stat,RM1,RM2,RM3,RM4,X,Y
1001.1,108.339,72.177,-288.725,*,1762.5,8.54,8.98,8.16,*,700050.21,8000022.24
1023,107.919,*,*,*,1790,6.99,*,*,*,7000340.79,8000041.51
1040,108.129,72.01,30.649,*,1812.5,9.61,8.46,8.37,*,7000504.18,8000246.15
1121,108.968,72.251,30.57,-16.819,1837.5,9.87,8.42,7.92,8.25,7002001.33,8000489.73
2102.525,109.807,72.489,*,*,1862.5,7.7,7.06,*,*,7004300.61,8001002.77