Question

Extracting Multiple Fasta Sequences At A Time From A File Containing Many Sequences

11

Entering edit mode

14.8 years ago

Sarah ▴ 110

I know a similar question was asked previously, but the "vanilla" perl code (which works beautifully if you want to extract ONE sequence, doesn't extract what I need (about 100 fasta sequences by ID from a big file containing 3000 total FASTA files concatenated in one file). Can anyone help? Perhaps it is a simple modification? Here is the code that was posted, I just don't know how to change it so I can input more than one ID (t currently fails if you put more than one). Many thanks to you knowledgeable folks out there!

(usage: extractSeqByID.pl SEQ123 < huge.fsa > my.fsa)

use warnings;
use strict;

my $lookup = shift @ARGV;  # ID to extract

local $/ = "\n>";  # read by FASTA record

while (my $seq = <>) {
    chomp $seq;
    my ($id) = $seq =~ /^>*(\S+)/;  # parse ID as first word in FASTA header
    if ($id eq $lookup) {
        $seq =~ s/^>*.+\n//;  # remove FASTA header
        $seq =~ s/\n//g;  # remove endlines
        print "$seq\n";
        last;
    }
}

fasta sequence parsing perl • 58k views

ADD COMMENT • link updated 14.8 years ago by Peter 6.0k • written 14.8 years ago by Sarah ▴ 110

0

Entering edit mode

Hi everyone.

i got a fasta file of 10 sequences. now i wanna write a python code that reads each sequence and parse it to RNAfold.exe for a secondary structure prediction. then, for each structure, print out some features like for example, number of G:U wobble pairs,CpG occurence and G+C content. can anybody help me out?i need your help becoz i am damn soo "good" at programming. thnaks a lot

ADD REPLY • link 13.3 years ago by User 2925 ▴ 10

0

Entering edit mode

If you have really MANY sequences (thousands or millions) it's worth to put them into database. I recommend SQLite (http://www.sqlite.org/) - query is extremely rapid, db is stored in just one file, and there are bindings for many languages (very good for Python for instance).

ADD REPLY • link 13.1 years ago by Leszek 4.2k

Ram · Answer 1 · 2010-10-08

17

Entering edit mode

14.8 years ago

Eric Normandeau 11k

Hi Sarah,

I am learning Perl myself and cannot directly answer your question (although I'll be happy to learn the solution too). However, here is some code in Python, requiring Biopython, that will do the trick:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
from Bio import SeqIO

fasta_file = sys.argv[1]  # Input fasta file
number_file = sys.argv[2] # Input interesting numbers file, one per line
result_file = sys.argv[3] # Output fasta file

wanted = set()
with open(number_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
end = False
with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if seq.id in wanted:
            SeqIO.write([seq], f, "fasta")

Save this to a file, make it executable, and call it in this way:

./script.py input.fasta name_list.txt output.fasta

The name_list.txt file should contain one ID per line. The ID should match exactly to the sequence name, although this can be modified easily. The input.fasta file can be of any size.

Cheers!

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 14.8 years ago by Eric Normandeau 11k

1

Entering edit mode

+1 for Biopython :) in addition, you can load ids from file using just ids= set( x.strip() for x in open(idfile) ) and you don't need to import re

ADD REPLY • link 14.0 years ago by Leszek 4.2k

1

Entering edit mode

Hi Eric. You are forgetting to close your file handles. Also, making multiple calls to SeqIO.write() is not very efficient, and won't work on more complex file formats. A single call is recommend, using an iterator approach, as in the "Filtering a sequence file" example in the Tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html which I will add as an answer separately below.

ADD REPLY • link 13.1 years ago by Peter 6.0k

0

Entering edit mode

Hi Leszek. I removed the 'import re' bit since it was useless anyway, just a carry over from a previous code. Cheers!

ADD REPLY • link 14.0 years ago by Eric Normandeau 11k

0

Entering edit mode

Hi Eric,is it possible to match the sequence ids in the wanted file with first 15 letters of each Id in input fasta file. I mean just matching a part of ids not whole. Thanks in advance,

ADD REPLY • link 13.7 years ago by Simran ▴ 50

0

Entering edit mode

Hi @Simran. It sure is possible. Replace 'wanted.add(line)' by 'wanted.add(line[0:15])' and also replace 'if seq.id in wanted:' by 'if seq.id[0:15] in wanted:' and it should do the trick. Cheers

ADD REPLY • link 13.7 years ago by Eric Normandeau 11k

0

Entering edit mode

Thanks for posting Biopython code - just learning now. I tried this but running the .py file says I dont have permission (w/ the ./) before it, and when I type in the code it gives this error: fasta_file = sys.argv[1] # Input fasta file IndexError: list index out of range

Any ideas? Thanks! -e

ADD REPLY • link 13.5 years ago by User 7521 ▴ 10

Ram · Answer 2 · 2010-10-08

As usual, Use BioPerl.

If you are working with fasta files and sequences just more than once, invest the time to install and use the basics of BioPerl. It has been tested for you, it is compact, efficient and smart.

Here an adaptation of a code I wrote using BioPerl for a similar task

usage: perl thisScript.pl fastaFile.fa queryFile.txt

use Bio::DB::Fasta;

my $fastaFile = shift;
my $queryFile = shift;

my $db = Bio::DB::Fasta->new( $fastaFile );
open (IN, $queryFile);
while (<IN>){
    chomp; 
    $seq = $_;
    my $sequence = $db->seq($seq);
    if  (!defined( $sequence )) {
            die "Sequence $seq not found. \n" 
    }   
    print ">$seq\n", "$sequence\n";
}

The first time you use it, it will index the fasta file, but then it will be superfast and will let you fetch all sequences (one ID per line in file)

The good thing is that after indexing, it KNOWS where every sequence is in the file and it will get it without going through the full file.

Ram · Answer 3 · 2010-10-08

Basically the modification is that for many IDs, you don't want to enter them as values on the command line. You would want to store them in a file. The script then reads that file, stores the IDs and looks for them in the FASTA records.

One approach is to store the query IDs as hash keys. You can then use exists() to see if that ID is stored.

Assuming that your IDs are in a file ids.txt with one ID per line, something like this should work:

#!/usr/bin/perl -w

use strict;

my $idsfile = "ids.txt";
my $seqfile = "seqs.fa";
my %ids  = ();

open FILE, $idsfile;
while(<FILE>) {
  chomp;
  $ids{$_} += 1;
}
close FILE;

local $/ = "\n>";  # read by FASTA record

open FASTA, $seqfile;
while (<FASTA>) {
    chomp;
    my $seq = $_;
    my ($id) = $seq =~ /^>*(\S+)/;  # parse ID as first word in FASTA header
    if (exists($ids{$id})) {
        $seq =~ s/^>*.+\n//;  # remove FASTA header
        $seq =~ s/\n//g;  # remove endlines
        print "$seq\n";
    }
}
close FASTA;

It's quite similar to the original, except for the removal of last (since we don't want to break on finding the first match). One thing to note is that if the regex fails to find an ID, the check for its existence as a hash key will fail.

Finally, as noted elsewhere on this site, you should make use of existing libraries for standard tasks such as parsing sequence formats. The relevant Bioperl tool is Bio::SeqIO.

Ram · Answer 4 · 2010-10-08

6

Entering edit mode

14.8 years ago

Rm 8.3k

perl -ne 'BEGIN{ $/=">"; } if(/^s*(S+)/){ open(F,">$1.fsa")||warn"$1 write failed:$!n";chomp;print F ">", $_ }' fastafile

"Split a multi-sequence FASTA file into individual files"

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 14.8 years ago by Rm 8.3k

0

Entering edit mode

blastdbcmd or fastacmd of BLAST suit can take input sequence IDs from a file and will output the corresponding sequences in fasta format from a sequence database.

ADD REPLY • link 14.8 years ago by Rm 8.3k

Ram · Answer 5 · 2010-10-08

5

Entering edit mode

14.8 years ago

biobot 0.0.77.a.1099 6.2k

If your Fasta file really is big, you do not want to be scanning the entire file to fetch 3% of your sequences. Index the file first, then you can use random access to fetch any number of sequences, in any order. There are many indexers, for example Exonerate comes with many useful and quick Fasta utilities, including an indexer and fetcher (written in C).

However, if we're talking Perl, BioPerl has this feature and a tutorial example with code Scan massive files? Just say no.

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 14.8 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

Absolutely the correct approach; see this related BioStar question and answers - Extracting Sequence From A 3Gb Fasta File

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 14.8 years ago by Neilfws 49k

score 5 · Answer 6 · 2011-07-22

Looks like nobody has mentioned the blast package? Use 'formatdb' with the -o option to index your fasta then use 'fastacmd' with -i, or -s, -L options to grab the sequences you want. All the perl scripts are probably reinventing the wheel and I'm quite sure all perl solutions would be much slower than 'fastacmd' if you have a big list. They're good practices for programming with bioperl, though.

Ram · Answer 7 · 2012-06-12

I don't know if you wanted a Python answer, but I'm posting this as an alternative to Eric's example using Biopython.

This based on the "Filtering a sequence file" example taken from the Biopython Tutorial and switched to using FASTA files instead of SFF files.

#!/usr/bin/env python

import sys
from Bio import SeqIO

input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]

wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)

records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted)
count = SeqIO.write(records, output_file, "fasta")

print "Saved %i records from %s to %s" % (count, input_file, output_file)
if count < len(wanted):
    print "Warning %i IDs not found in %s" % (len(wanted)-count, input_file)

This uses a generator expression which is memory efficient - only one record is loaded at a time, allowing this to work on files too big to load into memory at once. As pointed out by Keith James, if you only want a few of the records, it should be faster to index the file and just extract just the entries you want. e.g.

#!/usr/bin/env python

import sys
from Bio import SeqIO

input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]

wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)

index = SeqIO.index(input_file, "fasta")
records = (index[r] for r in wanted)
count = SeqIO.write(records, output_file, "fasta")
assert count == len(wanted)

print "Saved %i records from %s to %s" % (count, input_file, output_file)

If you intend to use the index more than once, the Bio.SeqIO.index_db(...) function can save the index to disk.