ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
So I need to remove the last four letters of the sequence, but keep the rest. I know how to ignore the header line, but all output I made looks empty, or with oly one sequence modified.
So what I want to see is everything but the letters "four" at the end of the relevant lines. Here's a script:
Code:
#!/usr/bin/env perl
use strict;
use warnings;
while(<>) {
if (m/^[[:alpha:]]/) {
substr( $_, -5 ) = "\n";
}
print;
}
That removes the last five characters (f-o-u-r and newline) from any line that begins with a letter of the alphabet (not > and not a space). It then replaces the removed bit with (no surprise) a newline again.
To run this, save it as "scriptname", and run it as "perl scriptname input_filename". (Or make it executable and do ./scriptname input_filename.) Sample output:
If your input file has different looking lines where you want to remove the last four characters, you would need a different initial regular expression.
Last edited by Telemachos; 02-16-2009 at 05:17 PM.
Thanks a lot. It's working fine. I'll improve to save to an output and use on a pretty large file. I'm a beginner in bioinformatics and pretty naive with thigs more info than bio.
Can you post the file you're working on - or at least a chunk of it? If it's skipping lines, that probably has to do with the regular expression. As I mentioned above, my version will only work on data that looks relevantly like the data you posted originally. That had three types of lines: 1) blanks, 2) lines starting with > and 3) lines of letters from the alphabet. The regex I wrote should catch all (and only) type 3. (Strictly speaking, the regex is testing for a line that begins with an alphabetic character.) At least, in the sample tests I ran here that's what happened. (See my sample above. It caught all four of the type 3 lines, not just the last one.)
On the other hand, I'm not sure what would make it chop off only two letters. Edit: here's one thought. If you're on Windows or the file is coming from a word processor, I suppose that the line ends may have other non-printing characters which are messing things up. Again, seeing the data and knowing more about the files would help.
Last edited by Telemachos; 02-16-2009 at 10:36 AM.
Reason: Added a clarification
I ran the script on the Linux Ubuntu or on the cygwin. The input is a file created in the software Artemis and I believe is in raw format with a .dna extension, I can open in the editor Notepad++, wich I think create right text files that usually work fine. The longest file created goes like this :
>contig3113.0010 - 21 bases upstream
ctttccaccacacacac
>contig3113.0020 - 21 bases upstream
tgaaaaaagtacaacaa
>contig3113.0030 - 21 bases upstream
tataagtacattggcac
>contig00011.0010 - 21 bases upstream
aagctgttttcacggag
>contig00011.0020 - 21 bases upstream
tgaaagccgaggactct
>contig00011.0025 - 21 bases upstream
acttgcatgagggggaa
>contig00011.0030 - 21 bases upstream
gaaacggataaaacagg
>contig00011.0040 - 21 bases upstream
ggaacagactaggagca
But with 2000 sequences.
should never stoped reading "Learning Perl" and skiped to bioperl...
Thanks
Well, I'm not sure what's going on at your end, but I can't reproduce your problem. When I run the script I suggested here (I'm on Debian Linux), the last four characters get clipped from all lines that don't begin with '>'. I can only guess that your file has some non-printing characters that are messing things up. (I ran an input file through unix2dos, and sure enough the dos version has one more character left than the unix one. But only one more character, which doesn't sych with what you described.)
Sorry, but I'm out of ideas. You might ask around your lab to see if anyone knows about the formatting of files produced by that Artemis program.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.