Modifying text files with perl

Tleilax · 02-16-2009, 05:47 AM

Hi

I'm a beginner with perl and was thinking this would be easy, but not for me.

I need to remove the last four letters of the second line of many strings like this

>contig3113.0010 - 21 bases upstream
ctttccaccacacacac

>contig3113.0020 - 21 bases upstream
tgaaaaaagtacaacaa

So I need to remove the last four letters of the sequence, but keep the rest. I know how to ignore the header line, but all output I made looks empty, or with oly one sequence modified.

Thanks for the help

ghostdog74 · 02-16-2009, 05:50 AM

what function or method in Perl are you using? substring()?? show your code.

Tleilax · 02-16-2009, 06:47 AM

I used
substr($string, 0, - 4);

And to ignore the header
} elsif($line =~ /^>/) {
next;

But I guess the next statement is wrong. but should I make a variable assignement to every header line?

Thanks

Telemachos · 02-16-2009, 06:52 AM

I changed your data to make it easier for me to see that I was getting it right. Data:

Code:

>contig3113.0010 - 21 bases upstream
ctttccaccacacacacfour

>contig3113.0020 - 21 bases upstream
tgaaaaaagtacaacaafour

>contig3113.0010 - 21 bases upstream
ctttccaccacacacacfour

>contig3113.0020 - 21 bases upstream
tgaaaaaagtacaacaafour

So what I want to see is everything but the letters "four" at the end of the relevant lines. Here's a script:

Code:

#!/usr/bin/env perl
use strict;
use warnings;

while(<>) {
  if (m/^[[:alpha:]]/) {
    substr( $_, -5 ) = "\n";
  }
  print;
}

That removes the last five characters (f-o-u-r and newline) from any line that begins with a letter of the alphabet (not > and not a space). It then replaces the removed bit with (no surprise) a newline again.

To run this, save it as "scriptname", and run it as "perl scriptname input_filename". (Or make it executable and do ./scriptname input_filename.) Sample output:

Code:

telemachus ~ $ perl remove_em input 
>contig3113.0010 - 21 bases upstream
ctttccaccacacacac

>contig3113.0020 - 21 bases upstream
tgaaaaaagtacaacaa

>contig3113.0010 - 21 bases upstream
ctttccaccacacacac

>contig3113.0020 - 21 bases upstream
tgaaaaaagtacaacaa

If your input file has different looking lines where you want to remove the last four characters, you would need a different initial regular expression.

Tleilax · 02-16-2009, 07:41 AM

Thanks a lot. It's working fine. I'll improve to save to an output and use on a pretty large file. I'm a beginner in bioinformatics and pretty naive with thigs more info than bio.

Tleilax · 02-16-2009, 07:50 AM

Bodering again...
Now I noticed that's taking out four letters only on the last sentence, the others it's taking just the two last letters.

Telemachos · 02-16-2009, 09:20 AM

Can you post the file you're working on - or at least a chunk of it? If it's skipping lines, that probably has to do with the regular expression. As I mentioned above, my version will only work on data that looks relevantly like the data you posted originally. That had three types of lines: 1) blanks, 2) lines starting with > and 3) lines of letters from the alphabet. The regex I wrote should catch all (and only) type 3. (Strictly speaking, the regex is testing for a line that begins with an alphabetic character.) At least, in the sample tests I ran here that's what happened. (See my sample above. It caught all four of the type 3 lines, not just the last one.)

On the other hand, I'm not sure what would make it chop off only two letters. Edit: here's one thought. If you're on Windows or the file is coming from a word processor, I suppose that the line ends may have other non-printing characters which are messing things up. Again, seeing the data and knowing more about the files would help.

Tleilax · 02-17-2009, 07:49 AM

I ran the script on the Linux Ubuntu or on the cygwin. The input is a file created in the software Artemis and I believe is in raw format with a .dna extension, I can open in the editor Notepad++, wich I think create right text files that usually work fine. The longest file created goes like this :
>contig3113.0010 - 21 bases upstream
ctttccaccacacacac
>contig3113.0020 - 21 bases upstream
tgaaaaaagtacaacaa
>contig3113.0030 - 21 bases upstream
tataagtacattggcac
>contig00011.0010 - 21 bases upstream
aagctgttttcacggag
>contig00011.0020 - 21 bases upstream
tgaaagccgaggactct
>contig00011.0025 - 21 bases upstream
acttgcatgagggggaa
>contig00011.0030 - 21 bases upstream
gaaacggataaaacagg
>contig00011.0040 - 21 bases upstream
ggaacagactaggagca

But with 2000 sequences.
should never stoped reading "Learning Perl" and skiped to bioperl...
Thanks

Telemachos · 02-17-2009, 01:54 PM

Well, I'm not sure what's going on at your end, but I can't reproduce your problem. When I run the script I suggested here (I'm on Debian Linux), the last four characters get clipped from all lines that don't begin with '>'. I can only guess that your file has some non-printing characters that are messing things up. (I ran an input file through unix2dos, and sure enough the dos version has one more character left than the unix one. But only one more character, which doesn't sych with what you described.)

Sorry, but I'm out of ideas. You might ask around your lab to see if anyone knows about the formatting of files produced by that Artemis program.