perl extract sequences
2
1
Entering edit mode
6.2 years ago
cabraham03 ▴ 20

Hi, I have a code to extract sequences and at the same time eliminate all the gaps (-), space, tabs, returns and space among each line of a sequences, some like :

from this :

>ID-Name
CCGCG  CTG--GATGCGGAC
ACCGA AGCAA-CCGCCAATA

to this :

>ID-Name
CCGCGCTGGATGCGGACACCGAAGCAACCGCCAATA

I have this code :

#!/usr/bin/perl

use strict;

my $input_file =$ARGV[0];
my $output_file =$ARGV[1];
if ($#ARGV !=1) { print "\n ** Wrong Arguments **\n\n"; print " - USE: fasta_remove.pl InFile.fasta OutFile.fasta\n"; } my$infile = $input_file; open INFILE,$infile or die  "Can't open $infile:$!\n";
my $outfile =$output_file;
open OUTFILE, ">$outfile" or die " - An output_file.fasta is Requested \n\n"; my$sequence = ();
my $line; my$idseq;
while ($line = <INFILE>) { chomp$line;
if($line =~ /^\s*$/) {
next;
}
elsif($line =~ /^\s*#/) { next; } elsif($line =~ tr/-//){
next;
}
elsif($line =~ /^>/) {$idseq= $line; print OUTFILE "\n$idseq\n";    # I know that the problem is here with "\n \n", but I don't know how to fix it !!!
next;
}
else {
$sequence =$line;
}

$sequence =~ s/\s//g; print OUTFILE "$sequence";
}

the problem is that alway make a line (whitespace) between the top of the file and the first sequence; I want to avoid that line,

if somebody can help me with that I will thank so much

sequence perl • 1.9k views
0
Entering edit mode

thanks so much !!

0
Entering edit mode

0
Entering edit mode

And your goal can be reached with a simple change:

else {
$sequence =$line;
}

to

else {
$sequence .=$line;
}

And please switch to BioPerl so these are handled better.

0
Entering edit mode

thanks so mucho, but it just don't work !!!

0
Entering edit mode

or you could use a simple one-liner instead if you are really eager to use perl:
 perl -ne ' if (/>/){ ($a > 0)?(print "\n$_"):(print $_);$a++; next}; chomp; s/ //g; s/-//g; print $_' test.txt > out  test.txt is your input file and out is your output ADD REPLY 3 Entering edit mode 6.2 years ago mxs ▴ 530 So this is a educational portal right? Please do not take this the wrong way, because I am personally advocate of "if it works don't fix it" principle, but from the above code it looks like you are just starting with Perl, so allow me to make a few suggestions: 1. use strict; # excellent practice. Even better would be to also use warnings. 2. my$input_file = $ARGV[0]; my$output_file = $ARGV[1]; No need for that you are just wasting memory. Though this may not be relevant for this particular case since only few bytes are lost better practice is to either directly use ARGV's or use Getopt module. So: open INFILE,$ARGV[0] or die  "Can't open $infile:$!\n";

my $input_file =$ARGV[0];
my $infile =$input_file;
open INFILE, $infile or die "Can't open$infile: $!\n"; Or if you use Getopt module then : use Getopt::Long; my ($infile,$outfile); GetOptions ('i=s' => \$infile, 'o=s' => \$outfile);  This way you don't need to have your ins and outs defined in a specific order, plus you have "option flags" and it is less likely to mix-up ins and outs 3. if ($#ARGV !=1) {
print "\n          ** Wrong Arguments **\n\n";
print "   - USE: fasta_remove.pl InFile.fasta OutFile.fasta\n";
}

Very bad practice, plus I think you should not allow a user to continue executing code if the condition is not satisfied. so there should  either be a die or exit function call after the last print. There are may ways how this could be done safely but I'll suggest one of the simplest ones in accordance to your code

if(!$infile or !$outfile){
print "\n          ** Wrong Arguments **\n\n";
print "Usage: perl program [options]\n";
print "\t-i\tinput file [fasta]\n";
print "\t-o\toutput file [fasta]\n";
exit(1);
}
"or die" in this type of situations is usually a safety measure reserved for verifying if the file is where it is supposed to be and if the write/read/execute permissions are allowing the action to take place.

4.

You have a lot of if-else statements. Totally not perl-ish. If this was a c code I would understand (personally I am a c/c++ programer) but this ... no. Moreover, do you really  need all those conditions?  Do they really need to be mutually exclusive? To me the logic of perl is like a logic of thinking and speaking, therefore if you phrase you conditions as you speak (English as a frame of reference ) you are probably off to a good start. Below you have my version of the parser:

#!/usr/bin/perl
use warnings;
use strict;
use Getopt::Long;

my ($infile,$outfile);
GetOptions ('i=s' => \$infile, 'o=s' => \$outfile);

if(!$infile or !$outfile){
print "\n          ** Wrong Arguments **\n\n";
print "Usage: perl program [options]\n";
print "\t-i\tinput file [fasta]\n";
print "\t-o\toutput file [fasta]\n";
exit(1);
}

open(IN, "<", $infile) or die "$!";
open(OUT, ">", $outfile) or die "$!";
my $lock = 0; while(<IN>){ chomp; if(/>/){ ($lock == 0) ? (print OUT "$_\n") : (print OUT "\n$_\n");
$lock = 1; next; } s/[\s|-]//g; print OUT "$_";
}
close IN;
close OUT;

It is important to close the filehandle after you are done unless you intend to loose some of the data. Perl flushes upon the exit but if you do not exit the program directly after you are done writing and intend to use the written data in another procedure, you will get into problems.

2
Entering edit mode

You make some good points, but I don't really agree with the second point because unless you are taking hundreds of files you would never notice the difference. It is much better to favour readability, and I agree with the Getopt solution for that reason,  it also makes the usage of the program much easier (avoids having a bunch of positional arguments).

There are a couple of other things that really stand out to me. Mainly, you should never use bare file handles, especially without specifying the mode. You can erase your data that way without knowing it, it's just too dangerous.

1
Entering edit mode

Thank you for the suggestions - I'm sure they'd help OP.

Also, I'd recommend switching to BioPerl - it is meant to be used to deal with biological formats.

2
Entering edit mode

"Also, I'd recommend switching to BioPerl - it is meant to be used to deal with biological formats."
That goes without saying.  If there is a tool out there use it, do not re-invent weals. My aim here was only to introduce some "good practice" strategies into the OP's code. That's how I learned. I first created a horrific piece of code gave it to a more experienced person to rewrite and then spent, sometimes days to study changes on that particular piece of code that I initially understood and could just focus on the "good practice" solutions introduced by the other guy.

0
Entering edit mode

thanks all of you very much, I really appreciate all yours comments, the true is that I just start to learn perl by myself, without previous knowledge of programming,  that is why my writing is all disorganized !!!! I will tray to do it better, it jus some times, I just don't find the way !!!

Well I think this will be a long way, but nobody said that it will be easy....... I like It !!!

2
Entering edit mode

It's OK that the code is not great, as long as you're willing to learn and better your code each step of the way, Perl is quite a challenging language to learn on your won, especially if you do not have programming fundamentals. If you'd like an easier alternative, go for Python.

2
Entering edit mode
6.2 years ago
Ram 32k

Change

elsif($line =~ /^>/) {$idseq= $line; print OUTFILE "\n$idseq\n";    # I know that the problem is here with "\n \n", but I don't know how to fix it !!!
next;
}
else {
$sequence =$line;
}

$sequence =~ s/\s//g; print OUTFILE "$sequence";
}

to

elsif($line =~ /^>/) {$idseq= $line; print OUTFILE "$idseq\n";    # I know that the problem is here with "\n \n", but I don't know how to fix it !!!
next;
}
else {
$sequence .=$line;
}

$sequence =~ s/\s//g; print OUTFILE "$sequence\n";
}

EDIT: Changed code to address OP's single-line FASTA requirement

0
Entering edit mode

it works, but with the

print OUTFILE "\$sequence\n";

it print the sequence like :

>ID-Name
CCGCGCTGGATGCGGAC
ACCGAAGCAACCGCCAAT

and I want it in a single line like

>ID-Name
CCGCGCTGGATGCGGACACCGAAGCAACCGCCAATA 
1
Entering edit mode

You mention in a comment that the code doesn't work. What do you mean by that, do you still see multiple lines?

0
Entering edit mode

when I run it with multiples fasta, each sequence are concatenated to the next!!!

But thanks so much for all your help, I appreciate it !!! thanks so much !!

0
Entering edit mode

That needs an additional loop to be added to the script. Please switch to BioPerl - it gets a LOT easier with BioPerl/BioPython as the complexity of the problem increases.