How Can I Automate Blast Of >1 Sequence And Output The Top Hits For Each Input?
4
12
Entering edit mode
11.2 years ago
bow ▴ 790

I'm looking for a way to do BLAST simultaneously with >1 sequence. The input is several fasta files or one file containing several sequence formatted in fasta. For an initial testing, I want to try obtaining the top 10 blast results for each input sequence and output them to a text file (one text file for each input sequence, or one big file containing all). But currently, I'm drawing a blank. Is there a way to do this with (preferably) biopython?

blast python biopython • 19k views
ADD COMMENT
1
Entering edit mode

There's an easy way to do this with bash, if all you need is the information from the top 1-10 hits.

ADD REPLY
9
Entering edit mode
11.2 years ago
brentp 23k

You can do this quite easily via the commandline. assuming your query and subject sequences are not protein. Here's a shell script, just adjust your QUERY and SUBJECT

QUERY=some.fasta
SUBJECT=/path/to/other.fasta

formatdb -p F -i $SUBJECT
blastall -p blastn -i $QUERY -d $SUBJECT -K 10 -m 8 -a 8 > blast.out

The -K 10 keeps only the 10 best hits, -m 8 makes the output in tab-delimited format, and -a 8 uses 8 processors. so blast.out will contain (up to) 10 best hits for each query.

ADD COMMENT
1
Entering edit mode

Ah, found the problem! It should be '-b' and '-K' :).

ADD REPLY
0
Entering edit mode

I tried this one too for a local search. It worked fine, except for the '-K' switch. I tried various values (1,5,10), but it keeps giving me plenty (~ >50) of results. Any pointers on where I might be doing wrong?

ADD REPLY
0
Entering edit mode

Hello bow and brentp,

I used the above commands for local blast and it worked wonderfully, I just needed to replace -K with -b. Thank you. However, I have the following questions about this commands,

blastall -p blastn -i $QUERY -d $SUBJECT -b 10 -m 8 -a 8 > blast.out
  1. The result does not show any headings for the columns, e-value & score columns r easily identifiable but I don't understand the other results. How can I get the headings?
  2. It doesn't give the positives in percent, how to get it?
ADD REPLY
6
Entering edit mode
11.2 years ago
Niek De Klein ★ 2.6k

If you want to do it over the internet using blastn and multiple fasta files it would look something like this:

from Bio.Blast import  NCBIWWW
from Bio.Blast import NCBIXML
file_string = ""    
x = 1

fasta_files = [fasta1, fasta2, fasta3, fasta4]
for i in fasta_files:
    File = open("output"+x+".txt","w")
    fasta_string = open(i+".fasta").read() #or make the names fasta1.fasta and just do open(i).read
    result_handle = NCBIWWW.qblast("blastn", "nr", fasta_string, hitlist_size=10)

    blast_records = NCBIXML.parse(result_handle) 
    #or blast_record = NCBIXML.read(result_handle) if you only have one seq in file
    E_VALUE_THRESH = 0.001
    for blast_record in blast_records:
        for alignment in blast_record.alignments:
            for hsp in alignment.hsps:
                if hsp.expect < E_VALUE_THRESH:
                   file_string += "alignment:",alignment.title+"\n"
                   file_string += "e-value:",hsp.expect+"\n"
    x += 1
    File.write(file_string)

Not sure if this was what you meant and I haven't tested it so may be some little faults in it, but it should take several different files of fasta sequences and write the results as a file as output1.txt, output2.txt etc. For multiple sequences in one file you can use the blast_records part

Hope this helps.

ADD COMMENT
0
Entering edit mode

Thanks! I changed it a bit (File = open("output"+str(x)+".txt","w")) and it works! I'll try tweaking on my own to make it accept a single file with multiple fasta sequence in it.

ADD REPLY
0
Entering edit mode

Hi, where you able to make it accept a single file with multiple fasta sequences in it? I really need help with this, thanks

ADD REPLY
0
Entering edit mode

I suggest using the "7.3 Parsing BLAST output" chapter of biopython tutorial.

ADD REPLY
0
Entering edit mode

Hi! I was wondering, will this work using protein sequences just by changing 'blastn' to 'blastp'?

ADD REPLY
2
Entering edit mode
11.2 years ago

As far as I remember, the NCBI blastall command takes any number of sequences for its input. In http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/blastall_node26.html

The query must be in FASTA format. If multiple entries are in the input file, all queries will be searched.

Then, there are many engines/parsers for digesting the BLAST output . On my side, I would use the XML output, insert the results in a database via XSLT and select the 10 best hits.

ADD COMMENT
0
Entering edit mode

update: yes Paulo is right. There is an easier solution using bash and the tab-delimited output.

ADD REPLY
0
Entering edit mode

I'm not yet familiar with XML or even XSLT, but thanks! It looks like I won't be running out of self-learning materials for the near future :).

ADD REPLY
0
Entering edit mode
5.9 years ago
mirza ▴ 140

Hello bow and brentp,

I used the above commands for local blast and it worked wonderfully, I just needed to replace -K with -b. Thank you. However, I have the following questions about this commands,

blastall -p blastn -i $QUERY -d $SUBJECT -b 10 -m 8 -a 8 > blast.out
  1. The result does not show any headings for the columns, e-value & score columns r easily identifiable but I don't understand the other results. How can I get the headings?
  2. It doesn't give the positives in percent, how to get it?
ADD COMMENT
0
Entering edit mode

@mirzabiotech2000:

  1. Column names can be found here for the outfmt=6: http://www.drive5.com/usearch/manual/blast6out.html More info: http://www.ncbi.nlm.nih.gov/books/NBK279675/

  2. In outfmt 6 you can see the identity/accuracy as a percentage in field 3. Is this what you are after?

ADD REPLY
0
Entering edit mode

thank you amblina for your reply. I will go through the link.

ADD REPLY

Login before adding your answer.

Traffic: 1631 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6