LinuxQuestions.org - transpose selected groups of rows of columns into multiple columns (arrays) with other columns retained to single row-group

Page 1 of 2

Show 50 post(s) from this thread on one page

- Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)

- - transpose selected groups of rows of columns into multiple columns (arrays) with other columns retained to single row-group (https://www.linuxquestions.org/questions/linux-general-1/transpose-selected-groups-of-rows-of-columns-into-multiple-columns-arrays-with-other-columns-retained-to-single-row-group-4175715785/)

TempleStone4510!

08-17-2022 04:59 AM

transpose selected groups of rows of columns into multiple columns (arrays) with other columns retained to single row-group

3 Attachment(s)

Hello,

Wish to know how to do the following [have been trying awk / some program functions (datamash in Ubuntu), plus what has been commented regarding columns to rows and vice versa]:

CSV file having N number of columns, and is sort numerically, with them in groups (e.g. consider column 1 listing the groups small to big), and other columns which change value within that group (an "array"). I am trying to output 1 line per group, and transpose the columns with changing values within a group to rows. Have attached a screen image and CSV (text) files of input and desired output.

Please note - Have shortened the examples: each group can consist of 10's of rows, and a number of columns which are to be transposed into 1 row-multiple columns, and all the other columns which don't change in a group to just 1-row-1-column.

In these files, have given 2 columns which need to be expanded (Level RM), and max of 4 lines in a group. The files I'm trying to change are few hundred lines and several columns with up to 10 (or so & varies group by group) rows which are to be transposed to columns.

I am not yet proficient in Linux: have used it very basically several years ago: simple grep awk sed.
If anyone can show me how to get the examples working that would great - I can then try expanding it from there to the larger files.

RandomTroll

08-17-2022 05:42 AM

I'd use tr with , as the delimiter. I also might just open it in a fancier spreadsheet program (I'd use Open Office) to do it.

allend

08-18-2022 12:32 AM

Your example _input_csv.txt file has DOS convention line endings. If you want to process such files using Linux tools, then first convert them. e.g. 'fromdos < _input_csv.txt > input.csv'.

It is easy to combine the groups into a single row using awk.

Code:

awk -F ',' '{if ($1==last) {printf "%s%s",",",$0} else {printf "%s%s","\n",$0};last=$1; next} END {printf "%s", "\n"}' _input.csv > _output.csv

As @RandomTroll suggests, using a spreadsheet program is probably the easiest way to reformat to your desired final output.

syg00

08-18-2022 05:35 AM

If only it was that simple - the OP seems to want to merge fields (not whole records) based on value(s) varying.
Ugh.

Looks like one would have to provide the number of (max) fields per each one that varies (4 for original column 2, 3 for original column 4 ...) - either provide it all as input, or parse the entire file to calculate them all, then (re-)process the file.
Bigger Ugh.

allend

08-18-2022 08:21 AM

It is not hard to delete or move entire columns in a spreadsheet. I am suggesting awk to relieve the tedious concatenation of the group data into one row, which I think is what the OP most wants to avoid.

boughtonp

08-18-2022 08:58 AM

Where does the source data come from?

If it's a database query which is exported to CSV, the transposing can be done in SQL, and will likely be more efficient.

Otherwise, if you're stuck with CSV data, Miller will be able to slice/transpose/join/etc to get the data in the format desired.

TempleStone4510!

08-18-2022 10:24 AM

Will try suggestions.

Many thanks for the posts - will try this weekend with the suggestions and reply back.

FYI - boughtonp, the files are generated as XYZ (csv) files, inversion output from a windows-based program, so output is not already in a database type program.

Trying to avoid spreadsheets as much as possible.

boughtonp

08-18-2022 01:03 PM

I was curious and trying to get Miller to do this (because I've not actually used it before), but got nowhere with it. Just trying to follow the docs was painful.

So I had a look at GNU Datamash (which I've not previously encountered), and it took very little time to come up with this:

Code:

$ datamash -t, -H --output-delimiter=$'\t' --full -g 1 count 2,4 collapse 2,4 < _Input_csv.txt

HPlt    Level  Stat    RM      X      Y      count(Level)    count(RM)      collapse(Level) collapse(RM)

1001.1  108.339 1762.5  8.54    700050.21      8000022.24      3      3      108.339,72.177,-288.725 8.54,8.98,8.16

1023    107.919 1790    6.99    7000340.79      8000041.51      1      1      107.919 6.99

1040    108.129 1812.5  9.61    7000504.18      8000246.15      3      3      108.129,72.01,30.649    9.61,8.46,8.37

1121    108.968 1837.5  9.87    7002001.33      8000489.73      4      4      108.968,72.251,30.57,-16.819    9.87,8.42,7.92,8.25

2102.525        109.807 1862.5  7.7    7004300.61      8001002.77      2      2      109.807,72.489  7.7,7.06

That's not the desired output format yet, but it does have all the necessary data into individual rows.

Next step would be to identify the max values of the "count" columns, rename the "collapse" headers into (e.g. "RM1,RM2,RM3,RM4"), add any necessary blanks to the corresponding rows (i.e. suffix max-count commas), then output just the desired fields and switch output delimiter back to commas - all of which is something Awk can do.

Looking at the other stuff Datamash can do, it looks like a pretty cool tool.

TB0ne

08-18-2022 01:21 PM

Quote:

Originally Posted by TempleStone4510! (Post 6374563)

Many thanks for the posts - will try this weekend with the suggestions and reply back. FYI - boughtonp, the files are generated as XYZ (csv) files, inversion output from a windows-based program, so output is not already in a database type program.

Would help if you said what this 'windows-based' program is...may be some options there.

Quote:

Trying to avoid spreadsheets as much as possible.

If the data is consistent, I'd personally create a MySQL database table, use the built-in CSV importer, then I could output the data in a LOT of different ways. Along with keeping archival data, it gives a ton of flexibility later on, if this is something you have to do often.

boughtonp

08-18-2022 05:06 PM

Quote:

Originally Posted by TB0ne (Post 6374602)

That's definitely something to consider - even if they don't have/want a MySQL server running, it wouldn't take much to write a simple SQLite import script.

TB0ne

08-18-2022 05:16 PM

Quote:

Originally Posted by boughtonp (Post 6374640)

That's definitely something to consider - even if they don't have/want a MySQL server running, it wouldn't take much to write a simple SQLite import script.

Agreed; I'd use MySQL, but only because I'm more familiar with it, but there is a ton out there on how to import things into a DB. And a script to read things/output wouldn't be difficult, even in bash, but you'd still be able to use python, perl, or any other language as well. And also gives you the ability to use ODBC into a spreadsheet, or any number of web-based tools to visualize data.

Just a ton more flexibility. However...it depends on the consistency of the input data, how much, and how often this needs to be done. If it's a few times for a class project...bite the bullet, spend an hour or so in a spreadsheet and call it done.

elgrandeperro

08-18-2022 11:18 PM

I think this works. It uses 2 arrays (level and rm) to store the variable number fields; I set the max to 4 with numlevel and numrm but it could be passed in. The data file I just called data; I removed the header by hand but of course you could easily skip it or if the app produces it, use it to set the number of fields.

Code:

#!/bin/bash

# numlevel and numrm are the max number of each field, could be passed in as an arg.

awk -F, \

 'BEGIN {last="";numlevel=4;numrm=4;i=1;j=1;}

function printline() {

 printf ("%s",hpit);

 # dump the num array

 for (i=1;i<=numlevel;i++) {

        if (level[i]=="") {

                printf(",*");

        } else {

                printf(",%s", level[i]);  

        }

 }

 printf (",%s",stat);

 # dump the rm array

 for (i=1;i<=numrm;i++) {

        if (rm[i]=="") {

                printf(",*");

        } else {

                printf(",%s", rm[i]);  

        }

 }

 printf (",%s,%s",xval,yval);

 printf ("\n");

 # clear our 2 arrays

 delete (level);

 delete (rm);

}

 # if we see a change on field one, dump the line and reset, but also the first record we should not print

 { FS=","; if (($1!=last) && (NR!=1)) {printline(); delete (level); delete(rm); i=1; j=1;} 

 hpit=$1;

 level[i++]=$2;

 stat=$3;

 rm[j++]=$4;

 xval=$5;

 yval=$6;

 last=$1;

 }

 # output the last line

 END {printline(); printf ("\n");}' data

syg00

08-19-2022 12:01 AM

Cheers - I was going to look at this after walking the mutt, but Kelly Slater tearing up Taehupo'o got in the way ... :p

Maybe over the weekend ...

allend

08-20-2022 02:46 AM

If you are going to script a solution, then you might as well add code to accept the data file name, do an initial run through the data file to find the maximum number of levels required and use that to print a header as well as format the lines.

Code:

#!/bin/bash



if [[ -z $1 ]]; then echo "Usage: $0 <data file>"; exit; fi

if [[ -f $1 ]]; then  data=$1; else echo  "No data file found. Exiting."; exit; fi



# Determine maximum number of levels

numlevel=$(awk -F ',' ' \

  {if (NR!=1) level[$1]++} \

  END {for (i in level) \

    if (level[i] > numlevel) {numlevel=level[i]}; \

      print numlevel}' $data)



# numlevel and numrm are the max number of each field, could be passed in as an arg.

awk -F ',' -v numlevel=$numlevel '

BEGIN {last="";numrm=numlevel;i=1;j=1

  printf("%s", "HPit")

  for (i=1;i<=numlevel;i++) {printf(",%s%s", "Level", i)}

  printf(",%s", "Stat")

  for (i=1;i<=numlevel;i++) {printf(",%s%s", "RM", i)}

  printf(",%s,%s%s", "X", "Y", "\n")

}

function printline() {

 printf ("%s",hpit);

 # dump the num array

 for (i=1;i<=numlevel;i++) {

        if (level[i]=="") {

                printf(",*");

        } else {

                printf(",%s", level[i]);

        }

 }

 printf (",%s",stat);

 # dump the rm array

 for (i=1;i<=numrm;i++) {

        if (rm[i]=="") {

                printf(",*");

        } else {

                printf(",%s", rm[i]);

        }

 }

 printf (",%s,%s",xval,yval);

 printf ("\n");

 # clear our 2 arrays

 delete (level);

 delete (rm);

}

 # if we see a change on field one, dump the line and reset, but also the first record we should not print

 { if (($1!=last) && (NR>2)) {printline(); delete (level); delete(rm); i=1; j=1; }}

 { if (NR>1) { hpit=$1;

 level[i++]=$2;

 stat=$3;

 rm[j++]=$4;

 xval=$5;

 yval=$6;

 last=$1; }

 }

 # output the last line

 END {printline(); printf ("\n");}' $data

PS - I have taken the liberty of changing the logic slightly to avoid printing the first record.

boughtonp

08-21-2022 01:56 PM

Here's the Datamash+Awk version - haven't bothered making the level/rm field numbers dynamic - easy to plug that in if necessary.

Code:

filename="_Input_csv.txt"



datamash -t, -H --output-delimiter=$'\t' -g 1,3,5,6 count 2,4 collapse 2,4 < "$filename" \

| awk -vOFS=',' 'BEGIN {num_level=4;num_rm=4}

NR==1 {

  for(r=1;r<=num_level;r++) h_level = h_level",Level"r

  for(r=1;r<=num_rm;r++) h_rm = h_rm",RM"r

  print "HPlt" h_level ",Stat" h_rm ",X,Y"

}

NR>1 {

  pad_level="";pad_rm="";

  for(r=$5;r<num_level;r++) pad_level = pad_level",*";

  for(r=$6;r<num_rm;r++) pad_rm = pad_rm",*";

  print $1,$7 pad_level,$2,$8 pad_rm,$3,$4

}'

Result matches the _Output_csv.txt:

Code:

HPlt,Level1,Level2,Level3,Level4,Stat,RM1,RM2,RM3,RM4,X,Y

1001.1,108.339,72.177,-288.725,*,1762.5,8.54,8.98,8.16,*,700050.21,8000022.24

1023,107.919,*,*,*,1790,6.99,*,*,*,7000340.79,8000041.51

1040,108.129,72.01,30.649,*,1812.5,9.61,8.46,8.37,*,7000504.18,8000246.15

1121,108.968,72.251,30.57,-16.819,1837.5,9.87,8.42,7.92,8.25,7002001.33,8000489.73

2102.525,109.807,72.489,*,*,1862.5,7.7,7.06,*,*,7004300.61,8001002.77

All times are GMT -5. The time now is 05:09 PM.

Page 1 of 2

Show 50 post(s) from this thread on one page