[SOLVED] deleting columns over multiple files

atjurhs · 09-17-2012, 06:55 PM

Hi guys,

I have a directory full of files, like maybe a hundred or two.

These files are suppppper large in size, and I only need a few of the columns of data. So I'd like to delete a bunch of the files and only keep the data I need.

I've found several ways to delete multiple columns of data on a single file, but I need to do that to lots of files.

Can someone please explain to me a general simple way to take the functionality of an awk statement and loop it over many files?

thanks!

Tabitha

pixellany · 09-17-2012, 07:23 PM

the basic construct is:

Code:

for filename in <list>; do
     <stuff--eg your AWK command(s)>
done

there are a number of ways of implementing <list>.....eg:

a simple wildcard (*)---acts on every file in the current directory
The output of something like ls|grep something
read from a file containing the filenames

the best solution depends on where your files are

atjurhs · 09-17-2012, 07:43 PM

so something like this would remove the 3rd, 16th, 21st, and 22nd columns across hundreds of files?

Code:

for filename in *409.txt:
do
   awk 'BEGIN { FS=" "; OFS=" " } {$3=$16=$21=$22"";gsub(",+",",",$0)}1'
done

How does filename get past into the awk command?

pixellany · 09-17-2012, 09:09 PM

You meant passed, not past......

The AWK syntax is
awk <commands> filename

so just add $filename after the AWK command string

colucix · 09-18-2012, 03:36 AM

Actually the loop is not really necessary, since awk accepts multiple arguments and it's able to process all the files in sequence. The code should be slightly corrected (mainly it misses a equal sign before the empty string) but it is not clear to me what the actual field separator is: blank spaces, commas or both?

Here is my suggestion applied to files whose fields (columns) are separated by blank spaces:

Code:

awk '{ $3=$16=$21=$22=""; gsub(/  +/,OFS); gsub(/^ +| +$/,""); print > ( FILENAME ".processed" ) }' *409.txt

The second gsub statement serves to remove blank spaces at the beginning or at the end of each line, in case you remove the first or the last columns. Furthermore, the output is redirected by awk itself to a file whose name is the same as the original one with the suffix ".processed" added. If you want to rename (overwrite) the original files after having checked the results, you can simply do

Code:

rename ".processed" "" *.processed

Beware that the syntax of the rename command may differ among various operating systems, hence - please - check your man pages to find out the correct one.

A final note: don't try to redirect the output of awk directly to the original input, that is using the internal variable FILENAME alone, because in case of large files the input is buffered and at some point (maybe after a few tenths of lines) it might be completely lost and the resulting output hopelessly truncated.

atjurhs · 09-21-2012, 01:32 PM

Hi guys, well the output wasn't exactly right.

here's what my input file *409.txt looks like, note even though many of the columns don't have data or only partial amounts of data, the file is a csv file.

head1, head2, head3, head4, head5, head6, head7, head8, head9, head10
16987423, 1960, 103, 0, 1333275624, -3.6742382, 0.01, 0, stringA, 56723867
, , , , 67332,5555555, , 0.04, , ,
, , , , ,5555556, 79633 , -68.1 , ,
, , , , ,5555559, , , ,
, , , , ,5555561, , , ,
, , , , ,5555562, , , ,
, , , , ,5555569, , , ,
58767422, 2101, 965, 0, 13333845222, -0.144383, 0.01, 0, stringB, 98673890

and this "block" of data repeat (just different number and strings) till EOF and that's a looooooong ways cause the files are REALLY BIG

after executing:

Code:

awk '{ $1=$2=$3=$4""; gsub(/  +/,OFS); gsub(/^ +| +$/,""); print > ( FILENAME ".processed" ) }' *409.txt

the output file that I got has in it:

no header line
0,0,0,0,1333275624, -3.6742382, 0.01, 0, stringA, 56723867
, , , , 67332,5555555, , 0.04, , ,
, , , , ,5555556, 79633, -68.1 , ,
, , , , ,5555559, , , ,
, , , , ,5555561, , , ,
, , , , ,5555568, , , ,
, , , , ,5555569, , , ,
0,0,0,0,13333845222, -0.144383, 0.01, 0, stringB, 98673890

so it looks like it didn't delete the first four columns of the data. It looks like it just replaced the data at the top and bottom of each "block" of data with zeros, and the number of comas on each row stayed the same.

My main goal here is to reduce the file size by removing columns of data I don't need, so having the zeros messes that up

these files are too large to suck into Excel, but if I could, I would simply highlight the columns I don't want and delete. this is the functionality that I want to have with this script

thanks sooooo much guys for your help!

Tabitha

colucix · 09-21-2012, 04:26 PM

Well.. the posted example clarifies that the fields are comma (not space separated) so that, with a slight modification the suggested code should work as expected. Said that, I'd like to make you notice that you repeated the typo (?) in your line of code, that is:

Code:

$1=$2=$3=$4""

that should be instead:

Code:

$1=$2=$3=$4=""

At this point, it should be clear why you got zeros in place of the first three fields: the fourth field was 0 and the (errouneous) statement assigned the value of $4 (followed by an empty string) to $1, $2 and $3. In the corrected statement, the empty string is assigned to $1, $2, $3 and $4.

The rest of the code serves to adjust the unwanted delimiters (if any). However I still have some doubt about the appearance of the desired output. Please, can you post an example of output based on the sample lines posted above?

Here I copy/paste your input into CODE tags, that reveal (preserve) the true spacing of the file:

Code:

16987423, 1960, 103, 0, 1333275624, -3.6742382, 0.01, 0, stringA, 56723867
        ,     ,    ,  , 67332,5555555,        , 0.04,  ,        ,
        ,     ,    ,  ,     ,5555556, 79633   , -68.1  ,        , 
        ,     ,    ,  ,     ,5555559,         ,        ,        ,
        ,     ,    ,  ,     ,5555561,         ,        ,        ,
        ,     ,    ,  ,     ,5555562,         ,        ,        ,        
        ,     ,    ,  ,     ,5555569,         ,        ,        ,
58767422, 2101, 965, 0, 13333845222, -0.144383, 0.01, 0, stringB, 98673890

Please, do the same to post the desired output, since the correct spacing is very relevant in order to suggest a suitable awk code. Thanks.

atjurhs · 09-21-2012, 05:33 PM

so I'm guessing that the coma based delimeter get stipulated in this part of the script

Code:

gsub(/  +/,OFS)

and that that was the case for space separated?

I'm thinking that

Code:

gsub(/,  +/,OFS)

would be used for coma separated?

Tabby

colucix · 09-22-2012, 02:35 AM

Code:

gsub(/  +/,OFS)

This substitutes every sequence of two or more spaces (that resulted from reducing some fields to empty strings in a file whose field separators are blank spaces) with the actual value of the Output Field Separator.

In a true CSV file, e.g.

Code:

field 1,field 2,field 3,field 4,field 5,field 6,field 7,field 8,field 9

we should want to remove the fields 1,2,4,6,7,9 e.g.

Code:

$ awk 'BEGIN{FS=OFS=","}{$1=$2=$4=$6=$7=$9=""; print}' file
,,field 3,,field 5,,,field 8,

from the obtained output we want to change the multiple commas in the interior of the line with a single comma AND remove the remaining commas at the edges. For the first task we do:

Code:

$ awk 'BEGIN{FS=OFS=","}{$1=$2=$4=$6=$7=$9=""; gsub(/,,+/,OFS); print}' file
,field 3,field 5,field 8,

and for the second one we add another gsub call:

Code:

$ awk 'BEGIN{FS=OFS=","}{$1=$2=$4=$6=$7=$9=""; gsub(/,,+/,OFS); gsub(/^,+|,+$/,""); print}' file
field 3,field 5,field 8

My doubt is that in your file you have some null fields made by blank spaces (even in the 5th column) hence I don't really know what the exact output should be (for this reason I asked). Anyway, the suggested code using comma as separator should give you a clue and maybe bring to a solution near to the desired output. Hope this clarifies a little bit.

David the H. · 09-23-2012, 08:50 AM

Quote:

Originally Posted by colucix

Actually the loop is not really necessary, since awk accepts multiple arguments and it's able to process all the files in sequence.

I think this depends on the total number of files passed. The command line does have a limit on the length of the command it can run, and globbing patterns are expanded before the command is executed. If it matches too many files it could overwhelm the system's ability to handle them.

A safer bet would be to either use find and a while+read loop, with null separators.

Code:

while IFS='' read -r -d '' fname; do

	awk '....' "$fname"

done < <( find . -maxdepth 1 -type f -name '*.409.txt' -print0 )

Actually, perhaps a better option overall would be to set up your awk command as a standalone script, and run it directly with find.

script.awk:

Code:

#!/usr/bin/awk -f
....your awk commands go here...

Make it executable and use it like this:

Code:

find . -maxdepth 1 -type f -name '*.409.txt' -exec script.awk '{}' \+

The + at the end means it will send as many files to the script at once as the system will allow. This makes it as efficient as possible within the limits of the system. Note that the script has to be able to handle the multiple input arguments, though that shouldn't be a problem with awk.

konsolebox · 09-23-2012, 09:44 AM

Awk is quite confusing to test. Anyway in bash you could make it work like this:

Code:

#!/bin/bash

# Change columns here. This is repetitive within the loop, but just for convenience.
REMOVE=(1 2 3 4)

# Extension of output file's name.
OUTPUTEXT='out'

IFS=','

for FILE; do
	while read LINE; do
		read -a FIELDS <<< "${LINE//, /,}"
		for I in "${REMOVE[@]}"; do
			unset "FIELDS[$I]"
		done
		LINE="${FIELDS[*]}"
		echo "${LINE//,/, }"
	done < "$FILE" > "$FILE.$OUTPUTEXT"
done

bash script.sh file1 file2 ...

atjurhs · 09-24-2012, 11:14 AM

colucix, I ran your last script

Code:

awk 'BEGIN{FS=OFS=","}{$1=$2=$4=$6=$7=$9=""; gsub(/,,+/,OFS); gsub(/^,+|,+$/,""); print}' file
field 3,field 5,field 8

and it said it could not find the file ???

so I tried

Code:

awk 'BEGIN{FS=OFS=","}{$1=$2=$4=$6=$7=$9=""; gsub(/,,+/,OFS); gsub(/^,+|,+$/,""); print > ( FILENAME ".processed" ) }' *409.txt

and the script found the file and ran perfectly BUT the header and 1st row of data were shifted in the output file, see below.

after removing all spaces, and all but one coma, per column so that it is a "true" *.csv file like: field1,field2,field3, etc.

column 1 has the header and first row of data from column 3 then from row 2 till the EOF it has column 1 data
column 2 has the header and first row of data from column 5 then from row 2 till the EOF it has column 2 data
column 3 has the header and first row of data from column 8 then from row 2 till the EOF it has column 3 data
column 4 has the header and first row of data from column 10 then from row 2 till the EOF it has column 4 data
column 5 has the header and first row of data from column 11 then from row 2 till the EOF it has column 5 data
column 6 has the header and first row of data from column 12 then from row 2 till the EOF it has column 6 data

so essentially the script only worked on the first row of data, but it did not delete any of the data columns below the first row.

the rest of the columns have the same pattern of being 6 columns off in the header and first row of data.

atjurhs · 09-24-2012, 02:06 PM

konsolebox

for constency in the data file I am using,I removed the same numbers of column data as colucix did, ie. 1 2 4 6 7 9 here's the results of running your bash script on the file

after removing all spaces, and all but one coma, per column so that it is a "true" *.csv file like: field1,field2,field3, etc.

column 1 is identical to the input file's column 1
column 2 has the header and first row of data from column 4 then from row 2 till the EOF it has column 2 data
column 3 has the header and first row of data from column 6 then from row 2 till the EOF it has column 3 data
column 4 has the header and first row of data from column 9 then from row 2 till the EOF it has column 4 data
column 5 has the header and first row of data from column 11 then from row 2 till the EOF it has column 5 data
column 6 has the header and first row of data from column 12 then from row 2 till the EOF it has column 6 data

so essentially the script only worked on the first row of data, but it was off in index by 1, and it did not delete any of the data columns below the first row.

the rest of the columns have the same pattern of being 6 columns off in the header and first row of data

konsolebox · 09-24-2012, 07:20 PM

@atjurhs: With respect to the results, could you give me a sample of the input and the output?

atjurhs · 09-24-2012, 08:58 PM

all the data both input and output is on a private LAN, so I'll do my best typing to reflect both.

the input data is very well depicted in post number 7

and here's an example of the output data. It that has varied amounts of spaces and comas in between each column that has data. Some of the columns contain no data, just separated by multiple spaces and comas. Other columns do have data that may of may not be separated by multiple spaces or comas.

1, 4242, 3.42323e+23, 0.1, 0, 0,5,294875, 8438393, 394,,,,,,,,
,0, ,0,,,, 0.487564, , ,0, 0,0, 87563,,,,,,,,, , 0 ,
,1, ,,,,,,,,,,,,,,,, 0, , , , 5,
,1, ,,,,,,5241,,,,, , , 0.4543e-3 , 0 , 111111111,
1, 1000,,,, 9576336e+10, 0.1, 0, 0, ,,, , 8438393, 001,

if by hand I delete all the spaces, and replace all multiple occurrences of comas (two or more consecutive) with a single coma, then the data looks like that below. note the 1st and 5th rows are missing an initial coma, or you could say that the 2nd through 4th rows have an extra coma at the front.

rc11,rc12,rc13,rc14,rc15, etc.
,rc21,rc22,rc23,rc24,rc25, etc.
,rc31,rc32,rc33,rc34,rc35, etc.
,rc41,rc42,rc43,rc44,rc45, etc.
rc51,rc52,rc53,rc54,rc55, etc.
etc.
etc.
etc.

I really hope this helps you! and REALLY appreciate you helping me!

Tabitha