LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 02-16-2009, 05:47 AM   #1
Tleilax
LQ Newbie
 
Registered: Feb 2009
Posts: 5

Rep: Reputation: 0
Question Modifying text files with perl


Hi

I'm a beginner with perl and was thinking this would be easy, but not for me.

I need to remove the last four letters of the second line of many strings like this

>contig3113.0010 - 21 bases upstream
ctttccaccacacacac

>contig3113.0020 - 21 bases upstream
tgaaaaaagtacaacaa

So I need to remove the last four letters of the sequence, but keep the rest. I know how to ignore the header line, but all output I made looks empty, or with oly one sequence modified.


Thanks for the help
 
Old 02-16-2009, 05:50 AM   #2
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
what function or method in Perl are you using? substring()?? show your code.
 
Old 02-16-2009, 06:47 AM   #3
Tleilax
LQ Newbie
 
Registered: Feb 2009
Posts: 5

Original Poster
Rep: Reputation: 0
I used
substr($string, 0, - 4);

And to ignore the header
} elsif($line =~ /^>/) {
next;

But I guess the next statement is wrong. but should I make a variable assignement to every header line?

Thanks
 
Old 02-16-2009, 06:52 AM   #4
Telemachos
Member
 
Registered: May 2007
Distribution: Debian
Posts: 754

Rep: Reputation: 60
I changed your data to make it easier for me to see that I was getting it right. Data:
Code:
>contig3113.0010 - 21 bases upstream
ctttccaccacacacacfour

>contig3113.0020 - 21 bases upstream
tgaaaaaagtacaacaafour

>contig3113.0010 - 21 bases upstream
ctttccaccacacacacfour

>contig3113.0020 - 21 bases upstream
tgaaaaaagtacaacaafour
So what I want to see is everything but the letters "four" at the end of the relevant lines. Here's a script:
Code:
#!/usr/bin/env perl
use strict;
use warnings;

while(<>) {
  if (m/^[[:alpha:]]/) {
    substr( $_, -5 ) = "\n";
  }
  print;
}
That removes the last five characters (f-o-u-r and newline) from any line that begins with a letter of the alphabet (not > and not a space). It then replaces the removed bit with (no surprise) a newline again.

To run this, save it as "scriptname", and run it as "perl scriptname input_filename". (Or make it executable and do ./scriptname input_filename.) Sample output:
Code:
telemachus ~ $ perl remove_em input 
>contig3113.0010 - 21 bases upstream
ctttccaccacacacac

>contig3113.0020 - 21 bases upstream
tgaaaaaagtacaacaa

>contig3113.0010 - 21 bases upstream
ctttccaccacacacac

>contig3113.0020 - 21 bases upstream
tgaaaaaagtacaacaa
If your input file has different looking lines where you want to remove the last four characters, you would need a different initial regular expression.

Last edited by Telemachos; 02-16-2009 at 05:17 PM.
 
Old 02-16-2009, 07:41 AM   #5
Tleilax
LQ Newbie
 
Registered: Feb 2009
Posts: 5

Original Poster
Rep: Reputation: 0
Thanks a lot. It's working fine. I'll improve to save to an output and use on a pretty large file. I'm a beginner in bioinformatics and pretty naive with thigs more info than bio.
 
Old 02-16-2009, 07:50 AM   #6
Tleilax
LQ Newbie
 
Registered: Feb 2009
Posts: 5

Original Poster
Rep: Reputation: 0
Bodering again...
Now I noticed that's taking out four letters only on the last sentence, the others it's taking just the two last letters.
 
Old 02-16-2009, 09:20 AM   #7
Telemachos
Member
 
Registered: May 2007
Distribution: Debian
Posts: 754

Rep: Reputation: 60
Can you post the file you're working on - or at least a chunk of it? If it's skipping lines, that probably has to do with the regular expression. As I mentioned above, my version will only work on data that looks relevantly like the data you posted originally. That had three types of lines: 1) blanks, 2) lines starting with > and 3) lines of letters from the alphabet. The regex I wrote should catch all (and only) type 3. (Strictly speaking, the regex is testing for a line that begins with an alphabetic character.) At least, in the sample tests I ran here that's what happened. (See my sample above. It caught all four of the type 3 lines, not just the last one.)

On the other hand, I'm not sure what would make it chop off only two letters. Edit: here's one thought. If you're on Windows or the file is coming from a word processor, I suppose that the line ends may have other non-printing characters which are messing things up. Again, seeing the data and knowing more about the files would help.

Last edited by Telemachos; 02-16-2009 at 10:36 AM. Reason: Added a clarification
 
Old 02-17-2009, 07:49 AM   #8
Tleilax
LQ Newbie
 
Registered: Feb 2009
Posts: 5

Original Poster
Rep: Reputation: 0
I ran the script on the Linux Ubuntu or on the cygwin. The input is a file created in the software Artemis and I believe is in raw format with a .dna extension, I can open in the editor Notepad++, wich I think create right text files that usually work fine. The longest file created goes like this :
>contig3113.0010 - 21 bases upstream
ctttccaccacacacac
>contig3113.0020 - 21 bases upstream
tgaaaaaagtacaacaa
>contig3113.0030 - 21 bases upstream
tataagtacattggcac
>contig00011.0010 - 21 bases upstream
aagctgttttcacggag
>contig00011.0020 - 21 bases upstream
tgaaagccgaggactct
>contig00011.0025 - 21 bases upstream
acttgcatgagggggaa
>contig00011.0030 - 21 bases upstream
gaaacggataaaacagg
>contig00011.0040 - 21 bases upstream
ggaacagactaggagca

But with 2000 sequences.
should never stoped reading "Learning Perl" and skiped to bioperl...
Thanks
 
Old 02-17-2009, 01:54 PM   #9
Telemachos
Member
 
Registered: May 2007
Distribution: Debian
Posts: 754

Rep: Reputation: 60
Well, I'm not sure what's going on at your end, but I can't reproduce your problem. When I run the script I suggested here (I'm on Debian Linux), the last four characters get clipped from all lines that don't begin with '>'. I can only guess that your file has some non-printing characters that are messing things up. (I ran an input file through unix2dos, and sure enough the dos version has one more character left than the unix one. But only one more character, which doesn't sych with what you described.)

Sorry, but I'm out of ideas. You might ask around your lab to see if anyone knows about the formatting of files produced by that Artemis program.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Modify a text files with awk/sed/perl climber75 Programming 15 08-05-2008 03:35 PM
Modifying text properties in Gimp pwabrahams Linux - Software 4 02-22-2008 05:07 PM
How do I replace text with perl with a list fo files? nadavvin Programming 7 09-14-2006 07:12 PM
Perl: Search and replace directories within text files Erhnam Programming 2 03-07-2006 04:07 AM
Perl Regex Help -- Readin In Text Files smaida Programming 1 04-04-2004 11:27 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:21 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration