Question

perl extract sequences

1

Entering edit mode

10.4 years ago

cabraham03 ▴ 30

Hi, I have a code to extract sequences and at the same time eliminate all the gaps (-), space, tabs, returns and space among each line of a sequences, some like:

From this:

>ID-Name
CCGCG  CTG--GATGCGGAC
ACCGA AGCAA-CCGCCAATA

to this:

>ID-Name
CCGCGCTGGATGCGGACACCGAAGCAACCGCCAATA

I have this code:

#!/usr/bin/perl

use strict;

my $input_file = $ARGV[0];
my $output_file = $ARGV[1];
if ($#ARGV !=1) {
    print "\n          ** Wrong Arguments **\n\n";
    print "   - USE: fasta_remove.pl InFile.fasta OutFile.fasta\n";
}
my $infile = $input_file;                              
open INFILE, $infile or die  "Can't open $infile: $!\n";       
my $outfile = $output_file;                             
open OUTFILE, ">$outfile" or die "   - An output_file.fasta is Requested \n\n";  
my $sequence = ();  
my $line;                           
my $idseq;
while ($line = <INFILE>) {
    chomp $line;                      
    if($line =~ /^\s*$/) {     
       next;
    }
    elsif($line =~ /^\s*#/) {        
        next; 
    }
    elsif($line =~ tr/-//){          
        next;
    }
    elsif($line =~ /^>/) {           
         $idseq= $line;
        print OUTFILE "\n $idseq\n";    # I know that the problem is here with "\n \n", but I don't know how to fix it !!!
         next;
    }
    else {
        $sequence = $line;
    }

    $sequence =~ s/\s//g;              
    print OUTFILE "$sequence";
}

The problem is that always make a line (whitespace) between the top of the file and the first sequence; I want to avoid that line, if somebody can help me with that I will thank so much

perl sequence • 3.7k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by cabraham03 ▴ 30

0

Entering edit mode

Or you could use a simple one-liner instead if you are really eager to use perl:

perl -ne ' if (/>/){ ($a > 0)?(print "\n$_"):(print $_);$a++; next}; chomp; s/ //g; s/-//g; print $_' test.txt > out

test.txt is your input file and out is your output.

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by mxs ▴ 530

Ram · Answer 1 · 2015-02-03

So this is a educational portal right? Please do not take this the wrong way, because I am personally advocate of "if it works don't fix it" principle, but from the above code it looks like you are just starting with Perl, so allow me to make a few suggestions:

1. use strict; # excellent practice. Even better would be to also use warnings

2.

my $input_file = $ARGV[0];
my $output_file = $ARGV[1];

No need for that you are just wasting memory. Though this may not be relevant for this particular case since only few bytes are lost better practice is to either directly use ARGV's or use Getopt module. So:

open INFILE, $ARGV[0] or die  "Can't open $infile: $!\n";

instead of

my $input_file = $ARGV[0];
my $infile = $input_file;                              
open INFILE, $infile or die  "Can't open $infile: $!\n";

Or if you use Getopt module then :

use Getopt::Long;
my ($infile,$outfile);
GetOptions ('i=s' => \$infile, 'o=s' => \$outfile);

This way you don't need to have your ins and outs defined in a specific order, plus you have "option flags" and it is less likely to mix-up ins and outs

3.

if ($#ARGV !=1) {
    print "\n          ** Wrong Arguments **\n\n";
    print "   - USE: fasta_remove.pl InFile.fasta OutFile.fasta\n";
}

Very bad practice, plus I think you should not allow a user to continue executing code if the condition is not satisfied. so there should either be a die or exit function call after the last print. There are may ways how this could be done safely but I'll suggest one of the simplest ones in accordance to your code

if(!$infile or !$outfile){
  print "\n          ** Wrong Arguments **\n\n";
  print "Usage: perl program [options]\n";
  print "\t-i\tinput file [fasta]\n";
  print "\t-o\toutput file [fasta]\n";
  exit(1);
}

or die in this type of situations is usually a safety measure reserved for verifying if the file is where it is supposed to be and if the write/read/execute permissions are allowing the action to take place.

4. You have a lot of if-else statements. Totally not perl-ish. If this was a c code I would understand (personally I am a c/c++ programer) but this ... no. Moreover, do you really need all those conditions? Do they really need to be mutually exclusive? To me the logic of perl is like a logic of thinking and speaking, therefore if you phrase you conditions as you speak (English as a frame of reference ) you are probably off to a good start. Below you have my version of the parser:

#!/usr/bin/perl
use warnings;
use strict;
use Getopt::Long;

my ($infile,$outfile);
GetOptions ('i=s' => \$infile, 'o=s' => \$outfile);

if(!$infile or !$outfile){
  print "\n          ** Wrong Arguments **\n\n";
  print "Usage: perl program [options]\n";
  print "\t-i\tinput file [fasta]\n";
  print "\t-o\toutput file [fasta]\n";
  exit(1);
}

open(IN, "<", $infile) or die "$!";
open(OUT, ">", $outfile) or die "$!";
my $lock = 0;
while(<IN>){
  chomp;
  if(/>/){
    ($lock == 0) ? (print OUT "$_\n") : (print OUT "\n$_\n");
    $lock = 1;
    next;
  }
  s/[\s|-]//g;
  print OUT "$_";
}
close IN;
close OUT;

It is important to close the filehandle after you are done unless you intend to loose some of the data. Perl flushes upon the exit but if you do not exit the program directly after you are done writing and intend to use the written data in another procedure, you will get into problems.

Ram · Answer 2 · 2015-02-02

2

Entering edit mode

10.4 years ago

Ram 45k

Change

elsif($line =~ /^>/) {           
         $idseq= $line;
        print OUTFILE "\n $idseq\n";    # I know that the problem is here with "\n \n", but I don't know how to fix it !!!
         next;
    }
    else {
        $sequence = $line;
    }

    $sequence =~ s/\s//g;              
    print OUTFILE "$sequence";
}

to

elsif($line =~ /^>/) {           
         $idseq= $line;
        print OUTFILE "$idseq\n";    # I know that the problem is here with "\n \n", but I don't know how to fix it !!!
         next;
    }
    else {
        $sequence .= $line;
    }

    $sequence =~ s/\s//g;              
    print OUTFILE "$sequence\n";
}

EDIT: Changed code to address OP's single-line FASTA requirement

ADD COMMENT • link 3.2 years ago by Ram 45k

0

Entering edit mode

Thanks so much!

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by cabraham03 ▴ 30

0

Entering edit mode

Can you move this to a comment on my answer please? Copy the contents, Click on "Add Comment" on my answer, paste the contents and hit "Add Comment".

ADD REPLY • link 10.4 years ago by Ram 45k

0

Entering edit mode

And your goal can be reached with a simple change:

else {
        $sequence = $line;
    }

to

else {
        $sequence .= $line;
    }

And please switch to BioPerl so these are handled better.

ADD REPLY • link 3.2 years ago by Ram 45k

0

Entering edit mode

thanks so much, but it just don't work!

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by cabraham03 ▴ 30

0

Entering edit mode

It works, but with the

print OUTFILE "$sequence\n";

it print the sequence like:

>ID-Name
CCGCGCTGGATGCGGAC
ACCGAAGCAACCGCCAAT

and I want it in a single line like

>ID-Name
CCGCGCTGGATGCGGACACCGAAGCAACCGCCAATA

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by cabraham03 ▴ 30

1

Entering edit mode

You mention in a comment that the code doesn't work. What do you mean by that, do you still see multiple lines?

ADD REPLY • link 10.4 years ago by Ram 45k

0

Entering edit mode

When I run it with multiples fasta, each sequence are concatenated to the next!!!

But thanks so much for all your help, I appreciate it !!! thanks so much!!

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by cabraham03 ▴ 30

0

Entering edit mode

That needs an additional loop to be added to the script. Please switch to BioPerl - it gets a LOT easier with BioPerl/BioPython as the complexity of the problem increases.

ADD REPLY • link 3.2 years ago by Ram 45k