Question: perl extract sequences
1
gravatar for cabraham03
2.5 years ago by
cabraham0320
Mexico
cabraham0320 wrote:

Hi, I have a code to extract sequences and at the same time eliminate all the gaps (-), space, tabs, returns and space among each line of a sequences, some like : 

from this : 

>ID-Name
CCGCG  CTG--GATGCGGAC
ACCGA AGCAA-CCGCCAATA

to this : 

>ID-Name
CCGCGCTGGATGCGGACACCGAAGCAACCGCCAATA

I have this code : 

#!/usr/bin/perl

use strict;

my $input_file = $ARGV[0];
my $output_file = $ARGV[1];
if ($#ARGV !=1) {
    print "\n          ** Wrong Arguments **\n\n";
    print "   - USE: fasta_remove.pl InFile.fasta OutFile.fasta\n";
}
my $infile = $input_file;                              
open INFILE, $infile or die  "Can't open $infile: $!\n";       
my $outfile = $output_file;                             
open OUTFILE, ">$outfile" or die "   - An output_file.fasta is Requested \n\n";  
my $sequence = ();  
my $line;                           
my $idseq;
while ($line = <INFILE>) {
    chomp $line;                      
    if($line =~ /^\s*$/) {     
       next;
    }
    elsif($line =~ /^\s*#/) {        
        next; 
    }
    elsif($line =~ tr/-//){          
        next;
    }
    elsif($line =~ /^>/) {           
         $idseq= $line;
        print OUTFILE "\n $idseq\n";    # I know that the problem is here with "\n \n", but I don't know how to fix it !!!
         next;
    }
    else {
        $sequence = $line;
    }

    $sequence =~ s/\s//g;              
    print OUTFILE "$sequence";
}

the problem is that alway make a line (whitespace) between the top of the file and the first sequence; I want to avoid that line,

 

if somebody can help me with that I will thank so much 

 

sequence perl • 922 views
ADD COMMENTlink modified 2.5 years ago by mxs510 • written 2.5 years ago by cabraham0320

 

thanks so much !! 

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by cabraham0320

Can you move this to a comment on my answer please? Copy the contents, Click on "Add Comment" on my answer, paste the contents and hit "Add Comment".

ADD REPLYlink written 2.5 years ago by Ram12k

And your goal can be reached with a simple change:

else {
        $sequence = $line;
    }

to

else {
        $sequence .= $line;
    }

And please switch to BioPerl so these are handled better.

 

ADD REPLYlink written 2.5 years ago by Ram12k

thanks so mucho, but it just don't work !!!

ADD REPLYlink written 2.5 years ago by cabraham0320

or you could use a simple one-liner instead if you are really eager to use perl:
perl -ne ' if (/>/){ ($a > 0)?(print "\n$_"):(print $_);$a++; next}; chomp; s/ //g; s/-//g; print $_' test.txt > out
 

test.txt is your input file and out is your output

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by mxs510
3
gravatar for mxs
2.5 years ago by
mxs510
mxs510 wrote:

So this is a educational portal right? Please do not take this the wrong way, because I am personally  advocate of "if it works don't fix it" principle, but from the above code it looks like you are just starting with Perl, so allow me to make a few suggestions:

 

1.

use strict;  # excellent practice. Even better would be to also use warnings.

2.

my $input_file = $ARGV[0];
my $output_file = $ARGV[1];

No need for that you are just wasting memory. Though this may not be relevant for this particular case since only few bytes are lost better practice is to either directly use ARGV's or use Getopt module. So:

open INFILE, $ARGV[0] or die  "Can't open $infile: $!\n"; 

instead of

my $input_file = $ARGV[0];
my $infile = $input_file;                              
open INFILE, $infile or die  "Can't open $infile: $!\n";

Or if you use Getopt module then :

use Getopt::Long;
my ($infile,$outfile);
GetOptions ('i=s' => \$infile, 'o=s' => \$outfile);

This way you don't need to have your ins and outs defined in a specific order, plus you have  "option flags"  and it is less likely to mix-up ins and outs

3.

if ($#ARGV !=1) {
    print "\n          ** Wrong Arguments **\n\n";
    print "   - USE: fasta_remove.pl InFile.fasta OutFile.fasta\n";
}

Very bad practice, plus I think you should not allow a user to continue executing code if the condition is not satisfied. so there should  either be a die or exit function call after the last print. There are may ways how this could be done safely but I'll suggest one of the simplest ones in accordance to your code

if(!$infile or !$outfile){
  print "\n          ** Wrong Arguments **\n\n";
  print "Usage: perl program [options]\n";
  print "\t-i\tinput file [fasta]\n";
  print "\t-o\toutput file [fasta]\n";
  exit(1);
}
"or die" in this type of situations is usually a safety measure reserved for verifying if the file is where it is supposed to be and if the write/read/execute permissions are allowing the action to take place. 

4.

You have a lot of if-else statements. Totally not perl-ish. If this was a c code I would understand (personally I am a c/c++ programer) but this ... no. Moreover, do you really  need all those conditions?  Do they really need to be mutually exclusive? To me the logic of perl is like a logic of thinking and speaking, therefore if you phrase you conditions as you speak (English as a frame of reference ) you are probably off to a good start. Below you have my version of the parser:

#!/usr/bin/perl
use warnings;
use strict;
use Getopt::Long;

my ($infile,$outfile);
GetOptions ('i=s' => \$infile, 'o=s' => \$outfile);

if(!$infile or !$outfile){
  print "\n          ** Wrong Arguments **\n\n";
  print "Usage: perl program [options]\n";
  print "\t-i\tinput file [fasta]\n";
  print "\t-o\toutput file [fasta]\n";
  exit(1);
}

open(IN, "<", $infile) or die "$!";
open(OUT, ">", $outfile) or die "$!";
my $lock = 0;
while(<IN>){
  chomp;
  if(/>/){
    ($lock == 0) ? (print OUT "$_\n") : (print OUT "\n$_\n");
    $lock = 1;
    next;
  }
  s/[\s|-]//g;
  print OUT "$_";
}
close IN;
close OUT;

It is important to close the filehandle after you are done unless you intend to loose some of the data. Perl flushes upon the exit but if you do not exit the program directly after you are done writing and intend to use the written data in another procedure, you will get into problems.

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by mxs510
2

You make some good points, but I don't really agree with the second point because unless you are taking hundreds of files you would never notice the difference. It is much better to favour readability, and I agree with the Getopt solution for that reason,  it also makes the usage of the program much easier (avoids having a bunch of positional arguments).

There are a couple of other things that really stand out to me. Mainly, you should never use bare file handles, especially without specifying the mode. You can erase your data that way without knowing it, it's just too dangerous.

ADD REPLYlink written 2.5 years ago by SES7.9k
1

Thank you for the suggestions - I'm sure they'd help OP.

Also, I'd recommend switching to BioPerl - it is meant to be used to deal with biological formats.

ADD REPLYlink written 2.5 years ago by Ram12k
2

"Also, I'd recommend switching to BioPerl - it is meant to be used to deal with biological formats."
That goes without saying.  If there is a tool out there use it, do not re-invent weals. My aim here was only to introduce some "good practice" strategies into the OP's code. That's how I learned. I first created a horrific piece of code gave it to a more experienced person to rewrite and then spent, sometimes days to study changes on that particular piece of code that I initially understood and could just focus on the "good practice" solutions introduced by the other guy. 

ADD REPLYlink written 2.5 years ago by mxs510

thanks all of you very much, I really appreciate all yours comments, the true is that I just start to learn perl by myself, without previous knowledge of programming,  that is why my writing is all disorganized !!!! I will tray to do it better, it jus some times, I just don't find the way !!!

Well I think this will be a long way, but nobody said that it will be easy....... I like It !!! 

ADD REPLYlink written 2.5 years ago by cabraham0320
2

It's OK that the code is not great, as long as you're willing to learn and better your code each step of the way, Perl is quite a challenging language to learn on your won, especially if you do not have programming fundamentals. If you'd like an easier alternative, go for Python.

ADD REPLYlink written 2.5 years ago by Ram12k
2
gravatar for Ram
2.5 years ago by
Ram12k
New York
Ram12k wrote:

Change

elsif($line =~ /^>/) {           
         $idseq= $line;
        print OUTFILE "\n $idseq\n";    # I know that the problem is here with "\n \n", but I don't know how to fix it !!!
         next;
    }
    else {
        $sequence = $line;
    }

    $sequence =~ s/\s//g;              
    print OUTFILE "$sequence";
}

to

elsif($line =~ /^>/) {           
         $idseq= $line;
        print OUTFILE "$idseq\n";    # I know that the problem is here with "\n \n", but I don't know how to fix it !!!
         next;
    }
    else {
        $sequence .= $line;
    }

    $sequence =~ s/\s//g;              
    print OUTFILE "$sequence\n";
}

EDIT: Changed code to address OP's single-line FASTA requirement

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Ram12k

it works, but with the 

print OUTFILE "$sequence\n"; 

it print the sequence like : 

>ID-Name
CCGCGCTGGATGCGGAC
ACCGAAGCAACCGCCAAT

and I want it in a single line like 

>ID-Name
CCGCGCTGGATGCGGACACCGAAGCAACCGCCAATA 
ADD REPLYlink written 2.5 years ago by cabraham0320
1

You mention in a comment that the code doesn't work. What do you mean by that, do you still see multiple lines?

ADD REPLYlink written 2.5 years ago by Ram12k

when I run it with multiples fasta, each sequence are concatenated to the next!!!

But thanks so much for all your help, I appreciate it !!! thanks so much !! 

ADD REPLYlink written 2.5 years ago by cabraham0320

That needs an additional loop to be added to the script. Please switch to BioPerl - it gets a LOT easier with BioPerl/BioPython as the complexity of the problem increases.

ADD REPLYlink written 2.5 years ago by Ram12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 761 users visited in the last hour