ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
yes, old thread; but it was fresh today; so I read it...
1: bash does indeed spawn subshells when a pipe is involved; ksh does not; so I often fall back to ksh for such things.
2: declaring variables is not really required; it can sometimes be handy though.
amcohen's script can therefore be altered to:
Code:
#!/bin/ksh
while read line
do
((count+=1))
done < filename
echo "The line count is: $count"
This should work for bash as well, as in this case a pipe is not involved.
In bash with pipe you might be able to do it like this:
Granted what will follow is processing a large text file (~ 3 million lines) but given the difference between the two methods I will almost always NEVER use while read line.
While Read Line - Processing a single file containing ~3 million lines, took over 50 hours (ended up giving up after that point).
Awk - Processing the same file. 90 seconds including a bit extra sed processing tacked onto the end.
Granted on tiny tiny files you will notice almost no improvement but on larger text files the difference is huge.
While Read Line - Processing a single file containing ~3 million lines, took over 50 hours (ended up giving up after that point).
Awk - Processing the same file. 90 seconds including a bit extra sed processing tacked onto the end.
These two are functionally different. If you performed the same process in both cases you would have had to invoke awk 3 million times, unless the awk functionality you used was trivial or you emulated awk with shell code. There is also a very limited set of things you can do with text in awk, compared to what you have access to from the shell (e.g. awkand other things). while read line doesn't cause line reading to be 2000x slower, and it certainly isn't used to manipulate text. Maybe you accidentally read from the terminal and it took you 50h to Ctrl+C out of it.
Kevin Barry
PS I don't even think I was using Linux when this thread was started...
AIUI, ksh does fork off subshells for piped commands if there's more than one. It just runs the last command in the chain in the current environment. So only environmental changes done in the first and last commands will be available outside the chain.
Bash 4.2 has also finally implemented the same feature, by the way, when you enable the new lastpipe shell option.
@R.Hicks
It's not nice to go around calling something "evil" just because you didn't properly understand it's strengths and weaknesses, and tried to use it to do something it wasn't really designed for.
Loops are simply a kind of flow control. They're used to execute a command or group of commands sequentially on a series of entries, or until a defined condition is reached. So of course using them to manipulate millions of lines in a file is going to take hours to process.
Where while+read is most useful is when processing lists of filenames, the output of other commands, and other similar situations where the number of iterations can be reasonably defined. Its use as a text manipulation tool is a secondary feature at best, and even then it's most efficient when the manipulations can be done entirely with built-in shell features like parameter substitutions.
These two are functionally different. If you performed the same process in both cases you would have had to invoke awk 3 million times, unless the awk functionality you used was trivial or you emulated awk with shell code. There is also a very limited set of things you can do with text in awk, compared to what you have access to from the shell (e.g. awkand other things). while read line doesn't cause line reading to be 2000x slower, and it certainly isn't used to manipulate text. Maybe you accidentally read from the terminal and it took you 50h to Ctrl+C out of it.
Kevin Barry
PS I don't even think I was using Linux when this thread was started...
Nice @ the reading the terminal line. Nope, not the case.
The script I originally used to process the 3 million lines worked flawlessly (taking ~ 60 seconds or so) for smaller (20,000 or so) numbers. The 3 million lines came about due to the server being restarted unexpectedly and this script not being included as part of a cronjob, so a backlog occurred.
Quote:
@R.Hicks
It's not nice to go around calling something "evil" just because you didn't properly understand it's strengths and weaknesses, and tried to use it to do something it wasn't really designed for.
While Read Line is not meant for parsing large text files, which is the point I was illustrating above. You seem to agree with me on this.
Quote:
Loops are simply a kind of flow control. They're used to execute a command or group of commands sequentially on a series of entries, or until a defined condition is reached. So of course using them to manipulate millions of lines in a file is going to take hours to process.
Where while+read is most useful is when processing lists of filenames,
It's still a list. Depending on the length and processing you're doing, while read line is not useful at all.
Quote:
the output of other commands, and other similar situations where the number of iterations can be reasonably defined. Its use as a text manipulation tool is a secondary feature at best, and even then it's most efficient when the manipulations can be done entirely with built-in shell features like parameter substitutions.
The manipulations were all done with built-ins pretty much using bash's own string manipulation functions ${var:3:9} etc... granted there were quite a few of them, however replacing the entire script with AWK as I say reduced processing time of large text files from hours, to seconds.
I stand by my initial comments. Processing large text files in while read line is EVIL and my experience with them has meant that whenever a size of a text file is not a known value beforehand I will prefer to do any string manipulation that I require to do in bulk, using AWK.
It's got nothing to do with 'while read line'.
The point is that bash (or any shell) is an interpreted lang; awk (& Perl) are not.
The first time I wrote prog in Perl at work was indeed to do some text manipulations on large text files.
The (ksh) soln worked, but was very slow.
Even just doing an almost line-by-line translation to Perl dropped it down from iirc several 10s of minutes to a few seconds (this was a long time ago ~ 10 yrs; memory a bit vague on fine detail).
Note that to get the max speedup you'd want to do the entire thing in awk or Perl, not start invoking them just to do the fiddly bits.
AIUI, ksh does fork off subshells for piped commands if there's more than one. It just runs the last command in the chain in the current environment. So only environmental changes done in the first and last commands will be available outside the chain.
Bash 4.2 has also finally implemented the same feature, by the way, when you enable the new lastpipe shell option.
Nice to learn a few new things:
0) AIUI is a new abbreviation for me ;-)
1) I didn't know about the limitation in ksh regarding the first and last command in the chain; good to know!
2) Didn't know either that bash since 4.2 found a way around this issue. Is this also only the 1st and last command in the chain?
Nice constructive comments. While I would agee processing a full 3 million-line text file is not really a shell task; I would indeed resort to other solutions than a full loop in a shell; I guess I'd write a C program for it instead. (I'm not very fluent with awk and perl, I fear)
I personally use bash, and/or Python, and/or R (statistics). If the 3m lines of text can be processed independently then I would use the split command and split the file into several parts possibly into numbered folders (depending on how many procs you have). And then launch all shells at once in parallel and combine your results at the end. A lot of times (such as in the case of scientific/genome data) each file requires a header but that's easy enough to process. Then have a shell script which fires off all the processes.
By doing this I have cut 5 hrs of processing down to 10 minutes on a 64 processor machine. That's out of the scope of the OP but in your case is why I made this recommendation.
@Ramurd: If you know C, you'll pick up Perl easy enough. Its a lot like C but easier.
I did about 8-9 yrs in C, then switched to Perl.
Runtime is a bit slower (~ 80-90% as fast as C), but speed of programming is much faster and easier to make resilient, as Perl strings, arrays, buffers etc are all handled for you ie no 'writing off the end' by accident.
I also think their references are easier to follow than C's ptr notation.
C is a good lang, but you don't need to go down to that level of detail for a lot of programming.
You can do it more than one way, using loops,exec,awk and redirection. Choice depend on requirement.
Check bellow link for different methods, and also given run time statistics.
I am using the following code to read line by line from a file, however my problem is that $value=0 at the end of the loop possibly because bash has created a subshell for the loop or something similar. How can I solve this.
value=0;
while read line
do
value=`expr $value + 1`;
echo $value;
done < "myfile"
echo $value;
Note: This example just counts the number of lines, I actually desire to do more complex processing than this though, so 'wc' is not an alternative, nor is perl im afraid.
Thanks Darren.
BASH has math operations built in and the `expr $value + 1` or $(($value+1)) are not needed.
try this:
Code:
let counter=0
echo $counter
let counter+=5
echo $counter
let counter+=10
echo $counter
let counter=counter/5
echo $counter
let counter=counter*6+3
let delta=13
let counter=counter/7+delta
echo $counter
I don't want to be mean, but run this script and see the difference in speed:
Code:
#!/bin/bash
bylet()
{
val=0
iter=1000000
for((i=0;i<$iter;i++))
do
let val+=${i}
done
echo ${val}
}
other()
{
val=0
iter=1000000
for((i=0;i<$iter;i++))
do
((val+=i))
done
echo ${val}
}
time bylet
time other
I got the significant difference of:
./speed.sh
499999500000
real 0m12.808s
user 0m12.540s
sys 0m0.251s
499999500000
real 0m8.609s
user 0m8.363s
sys 0m0.236s
I guess you can safely state that (( )) is way faster than using let
Last edited by Ramurd; 11-04-2011 at 06:20 AM.
Reason: whoops; wrong tags for the code
I don't want to be mean, but run this script and see the difference in speed:
Code:
#!/bin/bash
bylet()
{
val=0
iter=1000000
for((i=0;i<$iter;i++))
do
let val+=${i}
done
echo ${val}
}
other()
{
val=0
iter=1000000
for((i=0;i<$iter;i++))
do
((val+=i))
done
echo ${val}
}
time bylet
time other
I got the significant difference of:
./speed.sh
499999500000
real 0m12.808s
user 0m12.540s
sys 0m0.251s
499999500000
real 0m8.609s
user 0m8.363s
sys 0m0.236s
I guess you can safely state that (( )) is way faster than using let
I modified the code as "let val+=i" rather than "let val+=${i}" and got times of:
With the let the statement is "let val+=i" not "let val=${i}". The use of the "${i}" rather than "i" added a lot of time difference.
Code:
499999500000
real 0m7.804s
user 0m7.688s
sys 0m0.112s
499999500000
real 0m6.199s
user 0m6.082s
sys 0m0.112s
I then changed the for to be "for i in {0..999999}" in bylet and got times of:
Code:
499999500000
real 0m5.990s
user 0m5.897s
sys 0m0.090s
499999500000
real 0m6.385s
user 0m6.256s
sys 0m0.124s
With both for statements changed the times are:
Code:
499999500000
real 0m5.820s
user 0m5.733s
sys 0m0.083s
499999500000
real 0m4.269s
user 0m4.231s
sys 0m0.036s
From these I would say that "(( ))" has a slight edge but the number of de-references (using ${i} where it is not needed really effects the times. Also the use of the older "for i in {0..99999}" performs much faster than the"
Code:
iter=1000000
for((i=0;i<$iter;i++))
For two reasons here first the need to do a compare, a de-reference (i.e. $iter), and addition on each loop.
The use of the
Code:
for i in {0..999999}
just walk the set.
AS I started this out, I agreed that the usage of "(( .... ))" had a slight edge over the "let" statement. The difference was not as bad as you had shown as you used a "${i}" where only "i" was needed. This narrowed the time difference gap greatly. To improve time so that the time reflects the use of the doing math is a bigger part of the time generated.
Last edited by allanf; 11-05-2011 at 06:17 PM.
Reason: Put in more verbage....
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.