Question

How Can I Automate Blast Of >1 Sequence And Output The Top Hits For Each Input?

12

Entering edit mode

14.1 years ago

bow ▴ 790

I'm looking for a way to do BLAST simultaneously with >1 sequence. The input is several fasta files or one file containing several sequence formatted in fasta. For an initial testing, I want to try obtaining the top 10 blast results for each input sequence and output them to a text file (one text file for each input sequence, or one big file containing all). But currently, I'm drawing a blank. Is there a way to do this with (preferably) biopython?

blast python biopython • 23k views

ADD COMMENT • link updated 8.8 years ago by mirza ▴ 180 • written 14.1 years ago by bow ▴ 790

1

Entering edit mode

There's an easy way to do this with bash, if all you need is the information from the top 1-10 hits.

ADD REPLY • link 14.1 years ago by Paulo Nuin ★ 3.7k

Ram · Answer 1 · 2010-09-15

9

Entering edit mode

14.1 years ago

brentp 24k

You can do this quite easily via the commandline. assuming your query and subject sequences are not protein. Here's a shell script, just adjust your QUERY and SUBJECT

QUERY=some.fasta
SUBJECT=/path/to/other.fasta

formatdb -p F -i $SUBJECT
blastall -p blastn -i $QUERY -d $SUBJECT -K 10 -m 8 -a 8 > blast.out

The -K 10 keeps only the 10 best hits, -m 8 makes the output in tab-delimited format, and -a 8 uses 8 processors. so blast.out will contain (up to) 10 best hits for each query.

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 14.1 years ago by brentp 24k

1

Entering edit mode

Ah, found the problem! It should be '-b' and '-K' :).

ADD REPLY • link 14.1 years ago by bow ▴ 790

0

Entering edit mode

I tried this one too for a local search. It worked fine, except for the '-K' switch. I tried various values (1,5,10), but it keeps giving me plenty (~ >50) of results. Any pointers on where I might be doing wrong?

ADD REPLY • link 14.1 years ago by bow ▴ 790

0

Entering edit mode

Hello bow and brentp,

I used the above commands for local blast and it worked wonderfully, I just needed to replace -K with -b. Thank you. However, I have the following questions about this commands,

blastall -p blastn -i $QUERY -d $SUBJECT -b 10 -m 8 -a 8 > blast.out

The result does not show any headings for the columns, e-value & score columns r easily identifiable but I don't understand the other results. How can I get the headings?
It doesn't give the positives in percent, how to get it?

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by mirza ▴ 180

Ram · Answer 2 · 2010-09-15

6

Entering edit mode

14.1 years ago

Niek De Klein ★ 2.6k

If you want to do it over the internet using blastn and multiple fasta files it would look something like this:

from Bio.Blast import  NCBIWWW
from Bio.Blast import NCBIXML
file_string = ""    
x = 1

fasta_files = [fasta1, fasta2, fasta3, fasta4]
for i in fasta_files:
    File = open("output"+x+".txt","w")
    fasta_string = open(i+".fasta").read() #or make the names fasta1.fasta and just do open(i).read
    result_handle = NCBIWWW.qblast("blastn", "nr", fasta_string, hitlist_size=10)

    blast_records = NCBIXML.parse(result_handle) 
    #or blast_record = NCBIXML.read(result_handle) if you only have one seq in file
    E_VALUE_THRESH = 0.001
    for blast_record in blast_records:
        for alignment in blast_record.alignments:
            for hsp in alignment.hsps:
                if hsp.expect < E_VALUE_THRESH:
                   file_string += "alignment:",alignment.title+"\n"
                   file_string += "e-value:",hsp.expect+"\n"
    x += 1
    File.write(file_string)

Not sure if this was what you meant and I haven't tested it so may be some little faults in it, but it should take several different files of fasta sequences and write the results as a file as output1.txt, output2.txt etc. For multiple sequences in one file you can use the blast_records part

Hope this helps.

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 14.1 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

Thanks! I changed it a bit (File = open("output"+str(x)+".txt","w")) and it works! I'll try tweaking on my own to make it accept a single file with multiple fasta sequence in it.

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 14.1 years ago by bow ▴ 790

0

Entering edit mode

Hi, where you able to make it accept a single file with multiple fasta sequences in it? I really need help with this, thanks

ADD REPLY • link 12.0 years ago by Powerpuffgirl • 0

0

Entering edit mode

I suggest using the "7.3 Parsing BLAST output" chapter of biopython tutorial.

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 14.1 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

Hi! I was wondering, will this work using protein sequences just by changing 'blastn' to 'blastp'?

ADD REPLY • link 8.5 years ago by chups519 ▴ 10

Ram · Answer 3 · 2010-09-15

2

Entering edit mode

14.1 years ago

Pierre Lindenbaum 164k

As far as I remember, the NCBI blastall command takes any number of sequences for its input. In http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/blastall_node26.html

The query must be in FASTA format. If multiple entries are in the input file, all queries will be searched.

Then, there are many engines/parsers for digesting the BLAST output . On my side, I would use the XML output, insert the results in a database via XSLT and select the 10 best hits.

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 14.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

update: yes Paulo is right. There is an easier solution using bash and the tab-delimited output.

ADD REPLY • link 14.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I'm not yet familiar with XML or even XSLT, but thanks! It looks like I won't be running out of self-learning materials for the near future :).

ADD REPLY • link 14.1 years ago by bow ▴ 790

Ram · Answer 4 · 2016-01-16

0

Entering edit mode

8.8 years ago

mirza ▴ 180

Hello bow and brentp,

I used the above commands for local blast and it worked wonderfully, I just needed to replace -K with -b. Thank you. However, I have the following questions about this commands,

blastall -p blastn -i $QUERY -d $SUBJECT -b 10 -m 8 -a 8 > blast.out

The result does not show any headings for the columns, e-value & score columns r easily identifiable but I don't understand the other results. How can I get the headings?
It doesn't give the positives in percent, how to get it?

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by mirza ▴ 180

0

Entering edit mode

@mirzabiotech2000:

Column names can be found here for the outfmt=6: http://www.drive5.com/usearch/manual/blast6out.html More info: http://www.ncbi.nlm.nih.gov/books/NBK279675/
In outfmt 6 you can see the identity/accuracy as a percentage in field 3. Is this what you are after?