Question: Extracting Multiple Fasta Sequences At A Time From A File Containing Many Sequences
6
gravatar for Sarah
3.5 years ago by
Sarah60
Sarah60 wrote:

I know a similar question was asked previously, but the "vanilla" perl code (which works beautifully if you want to extract ONE sequence, doesn't extract what I need (about 100 fasta sequences by ID from a big file containing 3000 total FASTA files concatenated in one file). Can anyone help? Perhaps it is a simple modification? Here is the code that was posted, I just don't know how to change it so I can input more than one ID (t currently fails if you put more than one). Many thanks to you knowledgeable folks out there!

(usage: extractSeqByID.pl SEQ123 < huge.fsa > my.fsa)

use warnings;
use strict;

my $lookup = shift @ARGV;  # ID to extract

local $/ = "\n>";  # read by FASTA record

while (my $seq = <>) {
    chomp $seq;
    my ($id) = $seq =~ /^>*(\S+)/;  # parse ID as first word in FASTA header
    if ($id eq $lookup) {
        $seq =~ s/^>*.+\n//;  # remove FASTA header
        $seq =~ s/\n//g;  # remove endlines
        print "$seq\n";
        last;
    }
}
ADD COMMENTlink modified 3.5 years ago by Peter3.7k • written 3.5 years ago by Sarah60

If you have really MANY sequences (thousands or millions) it's worth to put them into database. I recommend SQLite (http://www.sqlite.org/) - query is extremely rapid, db is stored in just one file, and there are bindings for many languages (very good for Python for instance).

ADD REPLYlink written 22 months ago by Leszek2.9k
11
gravatar for Stefano Berri
3.5 years ago by
Stefano Berri3.2k
Cambridge
Stefano Berri3.2k wrote:

As usual, Use BioPerl.

if you are working with fasta files and sequences just more than once, invest the time to install and use the basics of BioPerl. It has been tested for you, it is compact, efficient and smart.

here an adaptation of a code I wrote using BioPerl for a similar task

usage: perl thisScript.pl fastaFile.fa queryFile.txt

use Bio::DB::Fasta;

my $fastaFile = shift;
my $queryFile = shift;

my $db = Bio::DB::Fasta->new( $fastaFile );
open (IN, $queryFile);
while (<IN>){
    chomp; 
    $seq = $_;
    my $sequence = $db->seq($seq);
    if  (!defined( $sequence )) {
            die "Sequence $seq not found. \n" 
    }   
    print ">$seq\n", "$sequence\n";
}

The first time you use it, it will index the fasta file, but then it will be superfast and will let you fetch all sequences (one ID per line in file )

The good thing is that after indexing, it KNOWS where every sequence is in the file and it will get it without going through the full file.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Stefano Berri3.2k
7
gravatar for Neilfws
3.5 years ago by
Neilfws41k
Sydney, Australia
Neilfws41k wrote:

Basically the modification is that for many IDs, you don't want to enter them as values on the command line. You would want to store them in a file. The script then reads that file, stores the IDs and looks for them in the FASTA records.

One approach is to store the query IDs as hash keys. You can then use exists() to see if that ID is stored.

Assuming that your IDs are in a file "ids.txt" with one ID per line, something like this should work:

#!/usr/bin/perl -w

use strict;

my $idsfile = "ids.txt";
my $seqfile = "seqs.fa";
my %ids  = ();

open FILE, $idsfile;
while(<FILE>) {
  chomp;
  $ids{$_} += 1;
}
close FILE;

local $/ = "\n>";  # read by FASTA record

open FASTA, $seqfile;
while (<FASTA>) {
    chomp;
    my $seq = $_;
    my ($id) = $seq =~ /^>*(\S+)/;  # parse ID as first word in FASTA header
    if (exists($ids{$id})) {
        $seq =~ s/^>*.+\n//;  # remove FASTA header
        $seq =~ s/\n//g;  # remove endlines
        print "$seq\n";
    }
}
close FASTA;

It's quite similar to the original, except for the removal of last (since we don't want to break on finding the first match). One thing to note is that if the regex fails to find an ID, the check for its existence as a hash key will fail.

Finally, as noted elsewhere on this site, you should make use of existing libraries for standard tasks such as parsing sequence formats. The relevant Bioperl tool is Bio::SeqIO.

ADD COMMENTlink written 3.5 years ago by Neilfws41k
4
gravatar for Eric Normandeau
2.7 years ago by
Eric Normandeau7.1k
Quebec, Canada
Eric Normandeau7.1k wrote:

HI Sarah,

I am learning Perl myself and cannot directly answer your question (although I'll be happy to learn the solution too). However, here is some code in Python, requiring Biopython, that will do the trick:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
from Bio import SeqIO

fasta_file = sys.argv[1]  # Input fasta file
number_file = sys.argv[2] # Input interesting numbers file, one per line
result_file = sys.argv[3] # Output fasta file

wanted = set()
with open(number_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
end = False
with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if seq.id in wanted:
            SeqIO.write([seq], f, "fasta")

Save this to a file, make it executable, and call it in this way:

./script.py input.fasta name_list.txt output.fasta

The name_list.txt file should contain one ID per line. The ID should match exactly to the sequence name, although this can be modified easily. The input.fasta file can be of any size.

Cheers!

ADD COMMENTlink modified 2.7 years ago • written 3.5 years ago by Eric Normandeau7.1k

+1 for Biopython :) in addition, you can load ids from file using just ids= set( x.strip() for x in open(idfile) ) and you don't need to import re

ADD REPLYlink written 2.7 years ago by Leszek2.9k

Hi Leszek. I removed the 'import re' bit since it was useless anyway, just a carry over from a previous code. Cheers!

ADD REPLYlink written 2.7 years ago by Eric Normandeau7.1k

Hi Eric,is it possible to match the sequence ids in the wanted file with first 15 letters of each Id in input fasta file. I mean just matching a part of ids not whole. Thanks in advance,

ADD REPLYlink written 2.5 years ago by Simran30

Hi @Simran. It sure is possible. Replace 'wanted.add(line)' by 'wanted.add(line[0:15])' and also replace 'if seq.id in wanted:' by 'if seq.id[0:15] in wanted:' and it should do the trick. Cheers

ADD REPLYlink written 2.5 years ago by Eric Normandeau7.1k

Thanks for posting Biopython code - just learning now. I tried this but running the .py file says I dont have permission (w/ the ./) before it, and when I type in the code it gives this error: fasta_file = sys.argv[1] # Input fasta file IndexError: list index out of range

Any ideas? Thanks! -e

ADD REPLYlink written 2.3 years ago by User 75210

Hi Eric. You are forgetting to close your file handles. Also, making multiple calls to SeqIO.write() is not very efficient, and won't work on more complex file formats. A single call is recommend, using an iterator approach, as in the "Filtering a sequence file" example in the Tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html which I will add as an answer separately below.

ADD REPLYlink written 22 months ago by Peter3.7k
4
gravatar for Keith James
3.5 years ago by
Keith James5.4k
UK
Keith James5.4k wrote:

If your Fasta file really is big, you do not want to be scanning the entire file to fetch 3% of your sequences. Index the file first, then you can use random access to fetch any number of sequences, in any order. There are many indexers, for example Exonerate comes with many useful and quick Fasta utilities, including an indexer and fetcher (written in C).

However, if we're talking Perl, BioPerl has this feature and a tutorial example with code Scan massive files? Just say no.

ADD COMMENTlink written 3.5 years ago by Keith James5.4k

Absolutely the correct approach; see this related BioStar question and answers - http://biostar.stackexchange.com/questions/1196/extracting-sequence-from-a-3gb-fasta-file

ADD REPLYlink written 3.5 years ago by Neilfws41k
3
gravatar for Vitis
2.7 years ago by
Vitis1.3k
New York
Vitis1.3k wrote:

Looks like nobody has mentioned the blast package? Use 'formatdb' with the -o option to index your fasta then use 'fastacmd' with -i, or -s, -L options to grab the sequences you want. All the perl scripts are probably reinventing the wheel and I'm quite sure all perl solutions would be much slower than 'fastacmd' if you have a big list. They're good practices for programming with bioperl, though.

ADD COMMENTlink written 2.7 years ago by Vitis1.3k
2
gravatar for Rm
3.5 years ago by
Rm6.5k
US
Rm6.5k wrote:

perl -ne 'BEGIN{ $/=">"; } if(/^s*(S+)/){ open(F,">$1.fsa")||warn"$1 write failed:$!n";chomp;print F ">", $_ }' fastafile

"Split a multi-sequence FASTA file into individual files"

ADD COMMENTlink written 3.5 years ago by Rm6.5k

blastdbcmd or fastacmd of BLAST suit can take input sequence IDs from a file and will output the corresponding sequences in fasta format from a sequence database.

ADD REPLYlink written 3.5 years ago by Rm6.5k
2
gravatar for Peter
22 months ago by
Peter3.7k
Scotland, UK
Peter3.7k wrote:

I don't know if you wanted a Python answer, but I'm posting this as an alternative to Eric's example using Biopython.

This based on the "Filtering a sequence file" example taken from the Biopython Tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html and switched to using FASTA files instead of SFF files.

#!/usr/bin/env python

import sys
from Bio import SeqIO

input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]

wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)

records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted)
count = SeqIO.write(records, output_file, "fasta")

print "Saved %i records from %s to %s" % (count, input_file, output_file)
if count < len(wanted):
    print "Warning %i IDs not found in %s" % (len(wanted)-count, input_file)

This uses a generator expression which is memory efficient - only one record is loaded at a time, allowing this to work on files too big to load into memory at once. As pointed out by Keith James, if you only want a few of the records, it should be faster to index the file and just extract just the entries you want. e.g.

#!/usr/bin/env python

import sys
from Bio import SeqIO

input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]

wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)

index = SeqIO.index(input_file, "fasta")
records = (index[r] for r in wanted)
count = SeqIO.write(records, output_file, "fasta")
assert count == len(wanted)

print "Saved %i records from %s to %s" % (count, input_file, output_file)

If you intend to use the index more than once, the Bio.SeqIO.index_db(...) function can save the index to disk.

ADD COMMENTlink written 22 months ago by Peter3.7k
0
gravatar for User 2925
2.0 years ago by
User 292510
User 292510 wrote:

Hi everyone.

i got a fasta file of 10 sequences. now i wanna write a python code that reads each sequence and parse it to RNAfold.exe for a secondary structure prediction. then, for each structure, print out some features like for example, number of G:U wobble pairs,CpG occurence and G+C content. can anybody help me out?i need your help becoz i am damn soo "good" at programming. thnaks a lot

ADD COMMENTlink written 2.0 years ago by User 292510
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 280 users visited in the last hour