Question: How To Extract Multiple Fasta Sequences From A File With Perl?
gravatar for loicatraile
6.5 years ago by
loicatraile40 wrote:

Hi.. I have a FASTA file with many DNA sequences. I need to read the FASTA file, delete the header and save the sequences in diferent variables. Any suggestions on how to do it in Perl (please not Bio Perl)?

Example for the FASTA file:





perl fasta bioinformatics • 6.0k views
ADD COMMENTlink modified 6.5 years ago by anin.gregory90 • written 6.5 years ago by loicatraile40

is this a homework question?

ADD REPLYlink written 6.5 years ago by Gjain5.4k

Why not bioperl? That would be the easiest way to do this problem

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by anin.gregory90

Don't use bioperl when it is easy to write plain perl. Read my answer in How to efficiently parse a huge fastq file?. On fastq parsing, bioperl can be tens of times slower than plain perl.

ADD REPLYlink written 6.5 years ago by lh331k

Yes--you're quite right! It's best to use Bio::SeqIO or the like, in this situation, as it makes little sense to reinvent the wheel. Although many do simply enjoy reinventing the wheel... For some reason... (?) However, in this case the OP excluded BioPerl, so the wheel was reinvented. One nice result, though, is that the OP may only have to wait around for only ~10 milliseconds for the 'raw' Perl code to execute instead of a lengthy ~200 milliseconds for BioPerl. Oh, my...

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by Kenosis1.2k

Real-world data are frequently at the scale of GB in bioinfo. The difference between a fast C program and BioPerl is like 0.5 min vs. 1 hour or occasionally 1 day vs. half a year. When you learn to program in bioinfo, you have to learn the right solution for large-scale data at the beginning. In addition, you have to reinvent simple wheels to learn to solve complex real problems. If you cannot parse fasta with plain perl, you will not be able to parse a lot of customized data formats not supported by BioPerl. If this is a homework question, the teacher is right in all ways to disallow bioperl.

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by lh331k

The OP's case is simply not an issue of "real-world data," so an appeal to (eventually) using C for this purpose doesn't work. The choice of a programming language and algorithms is best done case-by-case, and shouldn't be informed, a priori, by a single construct (e.g., "speed"). One can't cogently (or rationally) argue for using a race car because of its "speed" if the current need is for a vehicle for city driving. Think case-by-case.

You said, "If this is a homework question, the teacher is right in all ways to disallow bioperl." This categorical statement expresses an alarming fundamentalism which effectively thwarts efforts to thoughtfully choose the best tool for the job at hand. Cultivating this attitude of a priori exclusion would inextricably hobble a student's ability to find the best programmatic solution(s) for a particular job, so should be avoided--not merely for the student's benefit, but employers greatly appreciate cognitive openness/adaptiveness. Additionally, if this particular problem arose within a learning environment, a cognitivist's pedagogy would support an instructional design which moves from simpler, module-based programming solutions (e.g., BioPerl) to more complex, individually-written solutions as an effective means toward developing a programmer's cognitive schema and programming-language proficiencies within students.

Over 20 years ago, I wrote different program segments in different languages: some parts in a high-level language, others in assembly because--in a particular case for a particular routine--assembly was the best choice. Think case-by-case.

As a side note, having to process GBs of data has nothing inherently to do with choosing a specific programming language for that job. Perl may be just as efficient as C for processing GBs of data, depending on how the data needs to be processed, as the disk i/o may actually render both languages equally 'fast' at the job. Think case-by-case. Perseverating on a single construct (e.g., "speed") is just too constraining.

If you're so interested in the execution speed of code, why are you not evangelizing for using Fortran or Objective-Caml? Perhaps the following read may be interesting to you: The 'C is Efficient' Language Fallacy.

The best to you...

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by Kenosis1.2k
gravatar for Kenosis
6.5 years ago by
Kenosis1.2k wrote:

The sequences may not be on a single line, so the script needs to accomodate for that. Give this, consider the following:

use strict;
use warnings;

my $seq;
while (<>) {
    if (/^>/) {
        print "$seq\n" if $seq;
        undef $seq;

    $seq .= $_;

    print "$seq\n" if $seq and eof ARGV;

Usage: perl inFile [>outFile]

The last, optional parameter directs output to a file.

Output on your dataset:


If a header line is detected, the current sequence is printed (if there is one), the sequence variable is undefined and the next line is read. For non-header lines, each is chomped and added to the sequence variable. Finally, the sequence variable is printed if the next file read would encounter the end of the file.

It's unclear to me what you mean by "save the sequences in diferent variables." However, you could push each sequence onto an array (after, for example, declaring my @seqArray): push @seqArray, $seq in both places where $seq is printed, and then process that array of sequences outside the while loop.

Hope this helps!

ADD COMMENTlink written 6.5 years ago by Kenosis1.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1305 users visited in the last hour