bash shell script read file line by line.

amcohen · 09-16-2011, 01:08 AM

expr is so Bourne shell old school.
Here's a neater way:

#!/bin/ksh
integer count
while read line; do
count=$((count + 1))
done < myfile
echo The line count is: ${count}

Note that you don't need any dollar signs within $(( ... ))

Also you could simply do:

echo The line count is: $((cat myfile | wc - l))

cristalp · 09-16-2011, 02:23 AM

I do not know what exactly you want to achieve later. But may be the internal for loop of awk may suit?

Code:

awk '{for (i = 1; i <= NR; i++); print NR}' YOURFILE

which just gives the same printout. I do not know if it fit your further goal.

Ramurd · 09-16-2011, 02:25 AM

yes, old thread; but it was fresh today; so I read it...

1: bash does indeed spawn subshells when a pipe is involved; ksh does not; so I often fall back to ksh for such things.
2: declaring variables is not really required; it can sometimes be handy though.

amcohen's script can therefore be altered to:

Code:

#!/bin/ksh

while read line
do
((count+=1))
done < filename

echo "The line count is: $count"

This should work for bash as well, as in this case a pipe is not involved.

In bash with pipe you might be able to do it like this:

Code:

count=$( $(cat file | while read line; do; ((count+=1)); echo $count; done) | tail -n 1)
echo $count

stupid, I know...

R.Hicks · 09-16-2011, 03:52 AM

While read line is evil.

Evil.

Granted what will follow is processing a large text file (~ 3 million lines) but given the difference between the two methods I will almost always NEVER use while read line.

While Read Line - Processing a single file containing ~3 million lines, took over 50 hours (ended up giving up after that point).
Awk - Processing the same file. 90 seconds including a bit extra sed processing tacked onto the end.

Granted on tiny tiny files you will notice almost no improvement but on larger text files the difference is huge.

ta0kira · 09-17-2011, 10:28 PM

Quote:

Originally Posted by R.Hicks

While Read Line - Processing a single file containing ~3 million lines, took over 50 hours (ended up giving up after that point).
Awk - Processing the same file. 90 seconds including a bit extra sed processing tacked onto the end.

These two are functionally different. If you performed the same process in both cases you would have had to invoke awk 3 million times, unless the awk functionality you used was trivial or you emulated awk with shell code. There is also a very limited set of things you can do with text in awk, compared to what you have access to from the shell (e.g. awk and other things). while read line doesn't cause line reading to be 2000x slower, and it certainly isn't used to manipulate text. Maybe you accidentally read from the terminal and it took you 50h to Ctrl+C out of it.
Kevin Barry

PS I don't even think I was using Linux when this thread was started...

David the H. · 09-18-2011, 06:46 AM

@Ramurd

AIUI, ksh does fork off subshells for piped commands if there's more than one. It just runs the last command in the chain in the current environment. So only environmental changes done in the first and last commands will be available outside the chain.

Bash 4.2 has also finally implemented the same feature, by the way, when you enable the new lastpipe shell option.

@R.Hicks

It's not nice to go around calling something "evil" just because you didn't properly understand it's strengths and weaknesses, and tried to use it to do something it wasn't really designed for.

Loops are simply a kind of flow control. They're used to execute a command or group of commands sequentially on a series of entries, or until a defined condition is reached. So of course using them to manipulate millions of lines in a file is going to take hours to process.

Where while+read is most useful is when processing lists of filenames, the output of other commands, and other similar situations where the number of iterations can be reasonably defined. Its use as a text manipulation tool is a secondary feature at best, and even then it's most efficient when the manipulations can be done entirely with built-in shell features like parameter substitutions.

R.Hicks · 09-19-2011, 04:50 AM

Quote:

Originally Posted by ta0kira

These two are functionally different. If you performed the same process in both cases you would have had to invoke awk 3 million times, unless the awk functionality you used was trivial or you emulated awk with shell code. There is also a very limited set of things you can do with text in awk, compared to what you have access to from the shell (e.g. awk and other things). while read line doesn't cause line reading to be 2000x slower, and it certainly isn't used to manipulate text. Maybe you accidentally read from the terminal and it took you 50h to Ctrl+C out of it.
Kevin Barry

PS I don't even think I was using Linux when this thread was started...

Nice @ the reading the terminal line. Nope, not the case.

The script I originally used to process the 3 million lines worked flawlessly (taking ~ 60 seconds or so) for smaller (20,000 or so) numbers. The 3 million lines came about due to the server being restarted unexpectedly and this script not being included as part of a cronjob, so a backlog occurred.

Quote:

@R.Hicks

It's not nice to go around calling something "evil" just because you didn't properly understand it's strengths and weaknesses, and tried to use it to do something it wasn't really designed for.

While Read Line is not meant for parsing large text files, which is the point I was illustrating above. You seem to agree with me on this.

Quote:

Loops are simply a kind of flow control. They're used to execute a command or group of commands sequentially on a series of entries, or until a defined condition is reached. So of course using them to manipulate millions of lines in a file is going to take hours to process.

Where while+read is most useful is when processing lists of filenames,

It's still a list. Depending on the length and processing you're doing, while read line is not useful at all.

Quote:

the output of other commands, and other similar situations where the number of iterations can be reasonably defined. Its use as a text manipulation tool is a secondary feature at best, and even then it's most efficient when the manipulations can be done entirely with built-in shell features like parameter substitutions.

The manipulations were all done with built-ins pretty much using bash's own string manipulation functions ${var:3:9} etc... granted there were quite a few of them, however replacing the entire script with AWK as I say reduced processing time of large text files from hours, to seconds.

I stand by my initial comments. Processing large text files in while read line is EVIL and my experience with them has meant that whenever a size of a text file is not a known value beforehand I will prefer to do any string manipulation that I require to do in bulk, using AWK.

chrism01 · 09-20-2011, 01:01 AM

It's got nothing to do with 'while read line'.
The point is that bash (or any shell) is an interpreted lang; awk (& Perl) are not.
The first time I wrote prog in Perl at work was indeed to do some text manipulations on large text files.
The (ksh) soln worked, but was very slow.
Even just doing an almost line-by-line translation to Perl dropped it down from iirc several 10s of minutes to a few seconds (this was a long time ago ~ 10 yrs; memory a bit vague on fine detail).

Note that to get the max speedup you'd want to do the entire thing in awk or Perl, not start invoking them just to do the fiddly bits.

Ramurd · 09-20-2011, 01:11 AM

Quote:

Originally Posted by David the H.

@Ramurd

AIUI, ksh does fork off subshells for piped commands if there's more than one. It just runs the last command in the chain in the current environment. So only environmental changes done in the first and last commands will be available outside the chain.

Bash 4.2 has also finally implemented the same feature, by the way, when you enable the new lastpipe shell option.

Nice to learn a few new things:
0) AIUI is a new abbreviation for me ;-)
1) I didn't know about the limitation in ksh regarding the first and last command in the chain; good to know!
2) Didn't know either that bash since 4.2 found a way around this issue. Is this also only the 1st and last command in the chain?

Nice constructive comments. While I would agee processing a full 3 million-line text file is not really a shell task; I would indeed resort to other solutions than a full loop in a shell; I guess I'd write a C program for it instead. (I'm not very fluent with awk and perl, I fear)

sag47 · 09-20-2011, 06:00 AM

I personally use bash, and/or Python, and/or R (statistics). If the 3m lines of text can be processed independently then I would use the split command and split the file into several parts possibly into numbered folders (depending on how many procs you have). And then launch all shells at once in parallel and combine your results at the end. A lot of times (such as in the case of scientific/genome data) each file requires a header but that's easy enough to process. Then have a shell script which fires off all the processes.

By doing this I have cut 5 hrs of processing down to 10 minutes on a 64 processor machine. That's out of the scope of the OP but in your case is why I made this recommendation.

chrism01 · 09-20-2011, 08:38 PM

@Ramurd: If you know C, you'll pick up Perl easy enough. Its a lot like C but easier.
I did about 8-9 yrs in C, then switched to Perl.
Runtime is a bit slower (~ 80-90% as fast as C), but speed of programming is much faster and easier to make resilient, as Perl strings, arrays, buffers etc are all handled for you ie no 'writing off the end' by accident.
I also think their references are easier to follow than C's ptr notation.

C is a good lang, but you don't need to go down to that level of detail for a lot of programming.

kvmreddy · 10-09-2011, 06:12 AM

You can do it more than one way, using loops,exec,awk and redirection. Choice depend on requirement.
Check bellow link for different methods, and also given run time statistics.

How to read a fiel line by line in a shell script

allanf · 11-02-2011, 11:32 PM

Quote:

Originally Posted by Darren[UoW]

I am using the following code to read line by line from a file, however my problem is that $value=0 at the end of the loop possibly because bash has created a subshell for the loop or something similar. How can I solve this.

value=0;

while read line
do
value=`expr $value + 1`;
echo $value;
done < "myfile"

echo $value;

Note: This example just counts the number of lines, I actually desire to do more complex processing than this though, so 'wc' is not an alternative, nor is perl im afraid.

Thanks Darren.

BASH has math operations built in and the `expr $value + 1` or $(($value+1)) are not needed.
try this:

Code:

let counter=0
echo $counter
let counter+=5
echo $counter 
let counter+=10
echo $counter
let counter=counter/5
echo $counter
let counter=counter*6+3
let delta=13
let counter=counter/7+delta
echo $counter

Ramurd · 11-04-2011, 06:19 AM

I don't want to be mean, but run this script and see the difference in speed:

Code:

#!/bin/bash

bylet()
{
        val=0
        iter=1000000
        for((i=0;i<$iter;i++))
        do
                let val+=${i}
        done

        echo ${val}
}

other()
{
        val=0
        iter=1000000
        for((i=0;i<$iter;i++))
        do
                ((val+=i))
        done
        echo ${val}
}
time bylet
time other

I got the significant difference of:
./speed.sh
499999500000

real 0m12.808s
user 0m12.540s
sys 0m0.251s
499999500000

real 0m8.609s
user 0m8.363s
sys 0m0.236s

I guess you can safely state that (( )) is way faster than using let

allanf · 11-05-2011, 12:56 AM

Quote:

Originally Posted by Ramurd

I don't want to be mean, but run this script and see the difference in speed:

Code:

#!/bin/bash

bylet()
{
        val=0
        iter=1000000
        for((i=0;i<$iter;i++))
        do
                let val+=${i}
        done

        echo ${val}
}

other()
{
        val=0
        iter=1000000
        for((i=0;i<$iter;i++))
        do
                ((val+=i))
        done
        echo ${val}
}
time bylet
time other

I got the significant difference of:
./speed.sh
499999500000

real 0m12.808s
user 0m12.540s
sys 0m0.251s
499999500000

real 0m8.609s
user 0m8.363s
sys 0m0.236s

I guess you can safely state that (( )) is way faster than using let

I modified the code as "let val+=i" rather than "let val+=${i}" and got times of:

With the let the statement is "let val+=i" not "let val=${i}". The use of the "${i}" rather than "i" added a lot of time difference.

Code:

499999500000

real    0m7.804s
user    0m7.688s
sys     0m0.112s
499999500000

real    0m6.199s
user    0m6.082s
sys     0m0.112s

I then changed the for to be "for i in {0..999999}" in bylet and got times of:

Code:

499999500000

real    0m5.990s
user    0m5.897s
sys     0m0.090s
499999500000

real    0m6.385s
user    0m6.256s
sys     0m0.124s

With both for statements changed the times are:

Code:

499999500000

real    0m5.820s
user    0m5.733s
sys     0m0.083s
499999500000

real    0m4.269s
user    0m4.231s
sys     0m0.036s

From these I would say that "(( ))" has a slight edge but the number of de-references (using ${i} where it is not needed really effects the times. Also the use of the older "for i in {0..99999}" performs much faster than the"

Code:

        iter=1000000
        for((i=0;i<$iter;i++))

For two reasons here first the need to do a compare, a de-reference (i.e. $iter), and addition on each loop.
The use of the

Code:

        for i in {0..999999}

just walk the set.

AS I started this out, I agreed that the usage of "(( .... ))" had a slight edge over the "let" statement. The difference was not as bad as you had shown as you used a "${i}" where only "i" was needed. This narrowed the time difference gap greatly. To improve time so that the time reflects the use of the doing math is a bigger part of the time generated.