Extracting Multiple Fasta Sequences At A Time From A File Containing Many Sequences
7
11
Entering edit mode
11.1 years ago
Sarah ▴ 110

I know a similar question was asked previously, but the "vanilla" perl code (which works beautifully if you want to extract ONE sequence, doesn't extract what I need (about 100 fasta sequences by ID from a big file containing 3000 total FASTA files concatenated in one file). Can anyone help? Perhaps it is a simple modification? Here is the code that was posted, I just don't know how to change it so I can input more than one ID (t currently fails if you put more than one). Many thanks to you knowledgeable folks out there!

(usage: extractSeqByID.pl SEQ123 < huge.fsa > my.fsa)

use warnings;
use strict;

my $lookup = shift @ARGV; # ID to extract local$/ = "\n>";  # read by FASTA record

while (my $seq = <>) { chomp$seq;
my ($id) =$seq =~ /^>*(\S+)/;  # parse ID as first word in FASTA header
if ($id eq$lookup) {
$seq =~ s/^>*.+\n//; # remove FASTA header$seq =~ s/\n//g;  # remove endlines
print "$seq\n"; last; } }  fasta sequence parsing perl • 51k views ADD COMMENT 0 Entering edit mode Hi everyone. i got a fasta file of 10 sequences. now i wanna write a python code that reads each sequence and parse it to RNAfold.exe for a secondary structure prediction. then, for each structure, print out some features like for example, number of G:U wobble pairs,CpG occurence and G+C content. can anybody help me out?i need your help becoz i am damn soo "good" at programming. thnaks a lot ADD REPLY 0 Entering edit mode If you have really MANY sequences (thousands or millions) it's worth to put them into database. I recommend SQLite (http://www.sqlite.org/) - query is extremely rapid, db is stored in just one file, and there are bindings for many languages (very good for Python for instance). ADD REPLY 17 Entering edit mode 11.1 years ago Hi Sarah, I am learning Perl myself and cannot directly answer your question (although I'll be happy to learn the solution too). However, here is some code in Python, requiring Biopython, that will do the trick: #!/usr/bin/env python # -*- coding: utf-8 -*- import sys from Bio import SeqIO fasta_file = sys.argv[1] # Input fasta file number_file = sys.argv[2] # Input interesting numbers file, one per line result_file = sys.argv[3] # Output fasta file wanted = set() with open(number_file) as f: for line in f: line = line.strip() if line != "": wanted.add(line) fasta_sequences = SeqIO.parse(open(fasta_file),'fasta') end = False with open(result_file, "w") as f: for seq in fasta_sequences: if seq.id in wanted: SeqIO.write([seq], f, "fasta")  Save this to a file, make it executable, and call it in this way: ./script.py input.fasta name_list.txt output.fasta  The name_list.txt file should contain one ID per line. The ID should match exactly to the sequence name, although this can be modified easily. The input.fasta file can be of any size. Cheers! ADD COMMENT 1 Entering edit mode +1 for Biopython :) in addition, you can load ids from file using just ids= set( x.strip() for x in open(idfile) ) and you don't need to import re ADD REPLY 1 Entering edit mode Hi Eric. You are forgetting to close your file handles. Also, making multiple calls to SeqIO.write() is not very efficient, and won't work on more complex file formats. A single call is recommend, using an iterator approach, as in the "Filtering a sequence file" example in the Tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html which I will add as an answer separately below. ADD REPLY 0 Entering edit mode Hi Leszek. I removed the 'import re' bit since it was useless anyway, just a carry over from a previous code. Cheers! ADD REPLY 0 Entering edit mode Hi Eric,is it possible to match the sequence ids in the wanted file with first 15 letters of each Id in input fasta file. I mean just matching a part of ids not whole. Thanks in advance, ADD REPLY 0 Entering edit mode Hi @Simran. It sure is possible. Replace 'wanted.add(line)' by 'wanted.add(line[0:15])' and also replace 'if seq.id in wanted:' by 'if seq.id[0:15] in wanted:' and it should do the trick. Cheers ADD REPLY 0 Entering edit mode Thanks for posting Biopython code - just learning now. I tried this but running the .py file says I dont have permission (w/ the ./) before it, and when I type in the code it gives this error: fasta_file = sys.argv[1] # Input fasta file IndexError: list index out of range Any ideas? Thanks! -e ADD REPLY 17 Entering edit mode 11.0 years ago As usual, Use BioPerl. If you are working with fasta files and sequences just more than once, invest the time to install and use the basics of BioPerl. It has been tested for you, it is compact, efficient and smart. Here an adaptation of a code I wrote using BioPerl for a similar task usage: perl thisScript.pl fastaFile.fa queryFile.txt use Bio::DB::Fasta; my$fastaFile = shift;
my $queryFile = shift; my$db = Bio::DB::Fasta->new( $fastaFile ); open (IN,$queryFile);
while (<IN>){
chomp;
$seq =$_;
my $sequence =$db->seq($seq); if (!defined($sequence )) {
die "Sequence $seq not found. \n" } print ">$seq\n", "$sequence\n"; }  The first time you use it, it will index the fasta file, but then it will be superfast and will let you fetch all sequences (one ID per line in file) The good thing is that after indexing, it KNOWS where every sequence is in the file and it will get it without going through the full file. ADD COMMENT 8 Entering edit mode 11.1 years ago Neilfws 49k Basically the modification is that for many IDs, you don't want to enter them as values on the command line. You would want to store them in a file. The script then reads that file, stores the IDs and looks for them in the FASTA records. One approach is to store the query IDs as hash keys. You can then use exists() to see if that ID is stored. Assuming that your IDs are in a file ids.txt with one ID per line, something like this should work: #!/usr/bin/perl -w use strict; my$idsfile = "ids.txt";
my $seqfile = "seqs.fa"; my %ids = (); open FILE,$idsfile;
while(<FILE>) {
chomp;
$ids{$_} += 1;
}
close FILE;

local $/ = "\n>"; # read by FASTA record open FASTA,$seqfile;
while (<FASTA>) {
chomp;
my $seq =$_;
my ($id) =$seq =~ /^>*(\S+)/;  # parse ID as first word in FASTA header
if (exists($ids{$id})) {
$seq =~ s/^>*.+\n//; # remove FASTA header$seq =~ s/\n//g;  # remove endlines
print "$seq\n"; } } close FASTA;  It's quite similar to the original, except for the removal of last (since we don't want to break on finding the first match). One thing to note is that if the regex fails to find an ID, the check for its existence as a hash key will fail. Finally, as noted elsewhere on this site, you should make use of existing libraries for standard tasks such as parsing sequence formats. The relevant Bioperl tool is Bio::SeqIO. ADD COMMENT 6 Entering edit mode 11.0 years ago Rm 8.1k perl -ne 'BEGIN{$/=">"; } if(/^s*(S+)/){ open(F,">$1.fsa")||warn"$1 write failed:$!n";chomp;print F ">",$_ }' fastafile


"Split a multi-sequence FASTA file into individual files"

0
Entering edit mode

blastdbcmd or fastacmd of BLAST suit can take input sequence IDs from a file and will output the corresponding sequences in fasta format from a sequence database.

5
Entering edit mode
11.0 years ago

If your Fasta file really is big, you do not want to be scanning the entire file to fetch 3% of your sequences. Index the file first, then you can use random access to fetch any number of sequences, in any order. There are many indexers, for example Exonerate comes with many useful and quick Fasta utilities, including an indexer and fetcher (written in C).

However, if we're talking Perl, BioPerl has this feature and a tutorial example with code Scan massive files? Just say no.

0
Entering edit mode

Absolutely the correct approach; see this related BioStar question and answers - Extracting Sequence From A 3Gb Fasta File

5
Entering edit mode
10.3 years ago
Vitis ★ 2.5k

Looks like nobody has mentioned the blast package? Use 'formatdb' with the -o option to index your fasta then use 'fastacmd' with -i, or -s, -L options to grab the sequences you want. All the perl scripts are probably reinventing the wheel and I'm quite sure all perl solutions would be much slower than 'fastacmd' if you have a big list. They're good practices for programming with bioperl, though.

5
Entering edit mode
9.4 years ago
Peter 6.0k

I don't know if you wanted a Python answer, but I'm posting this as an alternative to Eric's example using Biopython.

This based on the "Filtering a sequence file" example taken from the Biopython Tutorial and switched to using FASTA files instead of SFF files.

#!/usr/bin/env python

import sys
from Bio import SeqIO

input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]

wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)

records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted)
count = SeqIO.write(records, output_file, "fasta")

print "Saved %i records from %s to %s" % (count, input_file, output_file)
if count < len(wanted):


This uses a generator expression which is memory efficient - only one record is loaded at a time, allowing this to work on files too big to load into memory at once. As pointed out by Keith James, if you only want a few of the records, it should be faster to index the file and just extract just the entries you want. e.g.

#!/usr/bin/env python

import sys
from Bio import SeqIO

input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]

wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)

index = SeqIO.index(input_file, "fasta")
records = (index[r] for r in wanted)
count = SeqIO.write(records, output_file, "fasta")
assert count == len(wanted)

print "Saved %i records from %s to %s" % (count, input_file, output_file)


If you intend to use the index more than once, the Bio.SeqIO.index_db(...) function can save the index to disk.