Perl replace text in file

Sergei Steshenko · 04-29-2009, 04:49 AM

Quote:

Originally Posted by raimizou

Thanks all for your help. I now have a functional script (it works at least on the simple "foo bar" example that I posted before), but the performance of this script is quite bad (while loops...). Do you have any idea on how to improve the code pasted below ? Thanks again for your valuable help.

Gilles

Code:

#!/usr/bin/perl -w
use strict;


#-----------------#
#     PREAMBLE    #
#-----------------#


# Configuration variables
my $correctLaTeX = $ARGV[2];
my $verbose = 0; # 0 = false

# Print the value of the command line arguments
if ($verbose){
  my $numArgs = $#ARGV + 1;
  print "You provided $numArgs arguments\n";
  print "Input file is $ARGV[0]\n";
  print "Output file is $ARGV[1]\n";
  print "Modification files is $ARGV[2]\n\n"
}

#-----------------#
# MAIN OPERATIONS #
#-----------------#

# Open input file in read mode
open INPUTFILE, "<", $ARGV[0] or die $!;
# Open output file in write mode
open OUTPUTFILE, ">", $ARGV[1] or die $!;

#$modif = "s/ foo / bar /g";

# Read the input file line by line :
while (my $input_line = <INPUTFILE>) {
  # remove the end of line character
  # Open the list of corrections in read mode
  open CORRECTIONFILE, "<", $correctLaTeX or die $!;	
  # Read the modification file line by line :
  while (my $modif = <CORRECTIONFILE>){
    # Remove the comments	  
    $modif =~ s/#.*$// ;
    if ($modif =~ /^[ 	]*$/) {
      # Nothing to do (empty modification)
    } else {
      if ($verbose){
        print("$input_line") ;
      }
      my $counter == 0
      # Apply the modification (up to twenty times)
      while ( eval("\$input_line =~ $modif") and $counter < 20 ){
	$counter += 1;
      };
      if ($verbose){
        print("$input_line") ;
      }
    }
  }
  # Write the modified line to the output file
  print OUTPUTFILE $input_line;   
  close CORRECTIONFILE;
}

# Close the input and output files
close INPUTFILE;
close OUTPUTFILE;

Why do you use 'eval' ? 'eval' is compilation + linking + execution, and that it is why your code is slow.

Reread regular expressions tutorial - you do not need 'eval'.

raimizou · 04-30-2009, 01:33 AM

Thanks, Sergei, for your answer. I will reread the perldoc perlretut.

I have tried not to use eval, but did not manage to avoid it yet. I use eval in

Code:

eval("\$input_line =~ $modif")

because the regular expression is inside a variable

Code:

$modif

and is applied on another variable

Code:

$input_line

.

Sergei Steshenko · 04-30-2009, 04:00 AM

Quote:

Originally Posted by raimizou

Thanks, Sergei, for your answer. I will reread the perldoc perlretut.

I have tried not to use eval, but did not manage to avoid it yet. I use eval in

Code:

eval("\$input_line =~ $modif")

because the regular expression is inside a variable

Code:

$modif

and is applied on another variable

Code:

$input_line

.

But regular expression can be written as

Code:

s/$match_regex/$replacement_string/

; pay attention to 'e' switch in regular expressions.

Whenever possible, try to have regular expressions which are known at compile time, and use 'o' switch for efficiency.

raimizou · 05-04-2009, 03:52 AM

Thanks for your post Sergei.

I appreciate your solution with

Code:

s/$match_regex/$replacement_string/

Unfortunately, one of the requirements for my software is that the list of corrections that I apply must be reusable, and may include more complex modifications than just a single substitution. Thus, I would like to keep a formulation that does not parse the regexp.

I have reread the regexp tutorial, as well as some web resources about the pre-compilation of the regexp (http://alumnus.caltech.edu/~svhwan/p...gExpLoops.html, http://modperlbook.org/html/6-5-3-Co...pressions.html ...). As I mentioned before, I am a real newbie in Perl, so it was difficult for me to understand the subtleties of pre-compilation.

With my limited understanding of what I read, I guess that I could save a lot of time by compiling a single time every regexp of my modification file. In the current version, a loop reads every line of the text file that must be modified. Then, a second loop compiles and applies every regexp of the modification file on the previous file. So, for a text file of 1000 lines I could divide the regexp compilation time by a factor of 1000 with the \o switch. Is that correct ?

What can I do for the substitution commands that reuse a captured pattern ($1, $2...) ; will they still work with the \o pattern ?

I really appreciate your help. Thanks a lot for sharing your knowledge.

Sergei Steshenko · 05-04-2009, 05:07 AM

Quote:

Originally Posted by raimizou

Thanks for your post Sergei.

I appreciate your solution with

Code:

s/$match_regex/$replacement_string/

Unfortunately, one of the requirements for my software is that the list of corrections that I apply must be reusable, and may include more complex modifications than just a single substitution. Thus, I would like to keep a formulation that does not parse the regexp.

I have reread the regexp tutorial, as well as some web resources about the pre-compilation of the regexp (http://alumnus.caltech.edu/~svhwan/p...gExpLoops.html, http://modperlbook.org/html/6-5-3-Co...pressions.html ...). As I mentioned before, I am a real newbie in Perl, so it was difficult for me to understand the subtleties of pre-compilation.

With my limited understanding of what I read, I guess that I could save a lot of time by compiling a single time every regexp of my modification file. In the current version, a loop reads every line of the text file that must be modified. Then, a second loop compiles and applies every regexp of the modification file on the previous file. So, for a text file of 1000 lines I could divide the regexp compilation time by a factor of 1000 with the \o switch. Is that correct ?

What can I do for the substitution commands that reuse a captured pattern ($1, $2...) ; will they still work with the \o pattern ?

I really appreciate your help. Thanks a lot for sharing your knowledge.

I have read the http://modperlbook.org/html/6-5-3-Co...pressions.html page you've mentioned, and yes, the idea to generate a piece of code in which regular expression will be literal, i.e.

Code:

my $pattern = '^\d+$';
eval q{
    foreach (@list) {
        print if /$pattern/o;
    }
}

is a good one.

AFAIK the $1, $2 ...$N mechanism is orthogonal to 'o' switch, i.e. it should work IMO regardless of /o.

...

Anyway, aren't you in too early in the optimization stage ? On the one hand, I'm pretty much aware of the 'o' switch, on the other, I rarely use it - even without it the scripts seem to be fast enough.

...

You probably can change your approach to something like

Code:

my @regexes_and_repclacements; # need to fill it
...
for(;;)
  {
  last unless @regexes_and_replacements;

  my $regex = shift @regexes_and_replacements;
  my $replacement = shift @regexes_and_replacements;

  my $match_and_replace_sub =
  sub
    {
    my ($line_scalar_ref) = @_;
    ${$line} =~ s/$regex/$replacement/o; # recompiled once for each new $regex, $replacement
    };


  foreach my $line(@lines)
    {
    $match_and_replace_sub->(\$line);
    # do something with $line after replacement if any
    }
  }

- the idea is that each line is processed a number of times, and for each time you have once compiled $regex, $replacement pair.

By the way, this is an example of closures - $match_and_replace_sub subroutine inherits lexical variables from outer scope.

justaddwater71 · 03-14-2010, 03:32 PM

macmoneta's one-liner solution just saved me a day of pain and suffering doing some machine learning. You rock.

vinay.baranwal · 08-10-2010, 02:28 PM

Dear all,
i am pretty new to perl, but love this language very much.

Perl one liners are like perfect piece of coding but i am facing a problem and seeks your guidance.

I have a file containing several entries like this LOC_Os04g58220|13104.t05295

I want to replace |13104.t05295 fragment with nothing. When i place this in the above mentioned one liner, it completes the job but somehow | character remains. One more thing worth mention here that these number are variables, so is there anyway with one liners that i can remove this particular string by using wild card characters??
I had tried, but alas! it didn't work for me.

perl -i.tiny -pe "s/(|************)//ge" Main.txt_yes

Since i am using windows platform hence " instead of ' is needed.

Prompt response will be much appreciated.

Thanks in advance

MTK358 · 08-10-2010, 04:00 PM

It would be much better to start your own thread.

Quote:

Originally Posted by vinay.baranwal

perl -i.tiny -pe "s/(|************)//ge" Main.txt_yes

Asterisk (*) does NOT mean "one of any character", it means "zero or more of the preceding character". Period (.) means "one of any character". Thus ".*" means "zero or more of any character".

So your regex should be "s/|.*//ge".

Quote:

Originally Posted by vinay.baranwal

Since i am using windows platform

Since you are using the Windows platform you don't belong in this forum. But there is nothing wrong with you trying Linux, you might like it! It's free, very easy, and you can try it out without risking your Windows installation using a "Live CD", which basically boots a Linux desktop off your CD/DVD drive!

Sergei Steshenko · 08-10-2010, 06:25 PM

Quote:

Originally Posted by MTK358

...
Since you are using the Windows platform you don't belong in this forum.
...

Wrong. This particular forum ("Programming") specifically allows programming questions regardless of OS.

vinay.baranwal · 08-11-2010, 09:43 AM

Thanks all buddy...

But i am sorry to say that problem still persist.

Since | is an operator symbol ("OR"). When i tried using (|.*) it replace the content of whole file.

And i need to delete | operator symbol from the file along with the numeric following this.

Would somebody help me out.

Thanks

MTK358 · 08-11-2010, 10:13 AM

Use "\|" for a literal "|" character.

You should really read a good regular expression tutorial.

vinay.baranwal · 08-11-2010, 11:53 AM

Yeah !!!!!! It did work.

Thanks buddy. Yeah sure i am going to read tutorials...

Thanks once again

ghostdog74 · 08-11-2010, 10:50 PM

Quote:

Originally Posted by vinay.baranwal

Perl one liners are like perfect piece of coding but i am facing a problem and seeks your guidance.

one liners are good when they are short and simple(to understand), but becomes messy and hard to read if you make it extremely long.

Quote:

Code:

perl -i.tiny -pe "s/(|************)//ge" Main.txt_yes

regular expressions are not always the way to go sometimes. with your requirement, you can just do string splitting. See perldoc -f split() and then get the first element. Or with newer Perl, you can use -F option.