Question

How To Extract Multiple Fasta Sequences From A File With Perl?

1

Entering edit mode

11.0 years ago

loicatraile ▴ 50

Hi

I have a FASTA file with many DNA sequences. I need to read the FASTA file, delete the header and save the sequences in different variables. Any suggestions on how to do it in Perl (please not Bio Perl)?

Example for the FASTA file:

>cow
ATGCTGTAGCGGAGGCATTATCGGCTA
>chicken
TAGCGTATTTTAG

fasta perl • 7.9k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 11.0 years ago by loicatraile ▴ 50

4

Entering edit mode

is this a homework question?

ADD REPLY • link 11.0 years ago by Gjain 5.8k

2

Entering edit mode

Why not bioperl? That would be the easiest way to do this problem

ADD REPLY • link 11.0 years ago by anin.gregory ▴ 110

0

Entering edit mode

Don't use bioperl when it is easy to write plain perl. Read my answer in How to efficiently parse a huge fastq file?. On fastq parsing, bioperl can be tens of times slower than plain perl.

ADD REPLY • link 11.0 years ago by lh3 33k

0

Entering edit mode

Yes--you're quite right! It's best to use Bio::SeqIO or the like, in this situation, as it makes little sense to reinvent the wheel. Although many do simply enjoy reinventing the wheel... For some reason... (?) However, in this case the OP excluded BioPerl, so the wheel was reinvented. One nice result, though, is that the OP may only have to wait around for only ~10 milliseconds for the 'raw' Perl code to execute instead of a lengthy ~200 milliseconds for BioPerl. Oh, my...

ADD REPLY • link 11.0 years ago by Kenosis ★ 1.3k

1

Entering edit mode

Real-world data are frequently at the scale of GB in bioinfo. The difference between a fast C program and BioPerl is like 0.5 min vs. 1 hour or occasionally 1 day vs. half a year. When you learn to program in bioinfo, you have to learn the right solution for large-scale data at the beginning. In addition, you have to reinvent simple wheels to learn to solve complex real problems. If you cannot parse fasta with plain perl, you will not be able to parse a lot of customized data formats not supported by BioPerl. If this is a homework question, the teacher is right in all ways to disallow bioperl.

ADD REPLY • link 11.0 years ago by lh3 33k

2

Entering edit mode

The OP's case is simply not an issue of "real-world data," so an appeal to (eventually) using C for this purpose doesn't work. The choice of a programming language and algorithms is best done case-by-case, and shouldn't be informed, a priori, by a single construct (e.g., "speed"). One can't cogently (or rationally) argue for using a race car because of its "speed" if the current need is for a vehicle for city driving. Think case-by-case.

You said, "If this is a homework question, the teacher is right in all ways to disallow bioperl." This categorical statement expresses an alarming fundamentalism which effectively thwarts efforts to thoughtfully choose the best tool for the job at hand. Cultivating this attitude of a priori exclusion would inextricably hobble a student's ability to find the best programmatic solution(s) for a particular job, so should be avoided--not merely for the student's benefit, but employers greatly appreciate cognitive openness/adaptiveness. Additionally, if this particular problem arose within a learning environment, a cognitivist's pedagogy would support an instructional design which moves from simpler, module-based programming solutions (e.g., BioPerl) to more complex, individually-written solutions as an effective means toward developing a programmer's cognitive schema and programming-language proficiencies within students.

Over 20 years ago, I wrote different program segments in different languages: some parts in a high-level language, others in assembly because--in a particular case for a particular routine--assembly was the best choice. Think case-by-case.

As a side note, having to process GBs of data has nothing inherently to do with choosing a specific programming language for that job. Perl may be just as efficient as C for processing GBs of data, depending on how the data needs to be processed, as the disk i/o may actually render both languages equally 'fast' at the job. Think case-by-case. Perseverating on a single construct (e.g., "speed") is just too constraining.

If you're so interested in the execution speed of code, why are you not evangelizing for using Fortran or Objective-Caml? Perhaps the following read may be interesting to you: The 'C is Efficient' Language Fallacy.

The best to you...

ADD REPLY • link 11.0 years ago by Kenosis ★ 1.3k

score 3 · Answer 1 · 2013-05-03

The sequences may not be on a single line, so the script needs to accomodate for that. Give this, consider the following:

use strict;
use warnings;

my $seq;
while (<>) {
    if (/^>/) {
        print "$seq\n" if $seq;
        undef $seq;
        next;
    }

    chomp;
    $seq .= $_;

    print "$seq\n" if $seq and eof ARGV;
}

Usage: perl script.pl inFile [>outFile]

The last, optional parameter directs output to a file.

Output on your dataset:

ATGCTGTAGCGGAGGCATTATCGGCTA
TAGCGTATTTTAG

If a header line is detected, the current sequence is printed (if there is one), the sequence variable is undefined and the next line is read. For non-header lines, each is chomped and added to the sequence variable. Finally, the sequence variable is printed if the next file read would encounter the end of the file.

It's unclear to me what you mean by "save the sequences in diferent variables." However, you could push each sequence onto an array (after, for example, declaring my @seqArray): push @seqArray, $seq in both places where $seq is printed, and then process that array of sequences outside the while loop.

Hope this helps!