How Can I Automate Blast Of >1 Sequence And Output The Top Hits For Each Input?
4
12
Entering edit mode
11.2 years ago
bow ▴ 790

I'm looking for a way to do BLAST simultaneously with >1 sequence. The input is several fasta files or one file containing several sequence formatted in fasta. For an initial testing, I want to try obtaining the top 10 blast results for each input sequence and output them to a text file (one text file for each input sequence, or one big file containing all). But currently, I'm drawing a blank. Is there a way to do this with (preferably) biopython?

blast python biopython • 19k views
1
Entering edit mode

There's an easy way to do this with bash, if all you need is the information from the top 1-10 hits.

9
Entering edit mode
11.2 years ago
brentp 23k

You can do this quite easily via the commandline. assuming your query and subject sequences are not protein. Here's a shell script, just adjust your QUERY and SUBJECT

QUERY=some.fasta
SUBJECT=/path/to/other.fasta

formatdb -p F -i $SUBJECT blastall -p blastn -i$QUERY -d $SUBJECT -K 10 -m 8 -a 8 > blast.out  The -K 10 keeps only the 10 best hits, -m 8 makes the output in tab-delimited format, and -a 8 uses 8 processors. so blast.out will contain (up to) 10 best hits for each query. ADD COMMENT 1 Entering edit mode Ah, found the problem! It should be '-b' and '-K' :). ADD REPLY 0 Entering edit mode I tried this one too for a local search. It worked fine, except for the '-K' switch. I tried various values (1,5,10), but it keeps giving me plenty (~ >50) of results. Any pointers on where I might be doing wrong? ADD REPLY 0 Entering edit mode Hello bow and brentp, I used the above commands for local blast and it worked wonderfully, I just needed to replace -K with -b. Thank you. However, I have the following questions about this commands, blastall -p blastn -i$QUERY -d SUBJECT -b 10 -m 8 -a 8 > blast.out  1. The result does not show any headings for the columns, e-value & score columns r easily identifiable but I don't understand the other results. How can I get the headings? 2. It doesn't give the positives in percent, how to get it? ADD REPLY 6 Entering edit mode 11.2 years ago Niek De Klein ★ 2.6k If you want to do it over the internet using blastn and multiple fasta files it would look something like this: from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML file_string = "" x = 1 fasta_files = [fasta1, fasta2, fasta3, fasta4] for i in fasta_files: File = open("output"+x+".txt","w") fasta_string = open(i+".fasta").read() #or make the names fasta1.fasta and just do open(i).read result_handle = NCBIWWW.qblast("blastn", "nr", fasta_string, hitlist_size=10) blast_records = NCBIXML.parse(result_handle) #or blast_record = NCBIXML.read(result_handle) if you only have one seq in file E_VALUE_THRESH = 0.001 for blast_record in blast_records: for alignment in blast_record.alignments: for hsp in alignment.hsps: if hsp.expect < E_VALUE_THRESH: file_string += "alignment:",alignment.title+"\n" file_string += "e-value:",hsp.expect+"\n" x += 1 File.write(file_string)  Not sure if this was what you meant and I haven't tested it so may be some little faults in it, but it should take several different files of fasta sequences and write the results as a file as output1.txt, output2.txt etc. For multiple sequences in one file you can use the blast_records part Hope this helps. ADD COMMENT 0 Entering edit mode Thanks! I changed it a bit (File = open("output"+str(x)+".txt","w")) and it works! I'll try tweaking on my own to make it accept a single file with multiple fasta sequence in it. ADD REPLY 0 Entering edit mode Hi, where you able to make it accept a single file with multiple fasta sequences in it? I really need help with this, thanks ADD REPLY 0 Entering edit mode I suggest using the "7.3 Parsing BLAST output" chapter of biopython tutorial. ADD REPLY 0 Entering edit mode Hi! I was wondering, will this work using protein sequences just by changing 'blastn' to 'blastp'? ADD REPLY 2 Entering edit mode 11.2 years ago As far as I remember, the NCBI blastall command takes any number of sequences for its input. In http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/blastall_node26.html The query must be in FASTA format. If multiple entries are in the input file, all queries will be searched. Then, there are many engines/parsers for digesting the BLAST output . On my side, I would use the XML output, insert the results in a database via XSLT and select the 10 best hits. ADD COMMENT 0 Entering edit mode update: yes Paulo is right. There is an easier solution using bash and the tab-delimited output. ADD REPLY 0 Entering edit mode I'm not yet familiar with XML or even XSLT, but thanks! It looks like I won't be running out of self-learning materials for the near future :). ADD REPLY 0 Entering edit mode 5.9 years ago mirza ▴ 140 Hello bow and brentp, I used the above commands for local blast and it worked wonderfully, I just needed to replace -K with -b. Thank you. However, I have the following questions about this commands, blastall -p blastn -iQUERY -d \$SUBJECT -b 10 -m 8 -a 8 > blast.out

1. The result does not show any headings for the columns, e-value & score columns r easily identifiable but I don't understand the other results. How can I get the headings?
2. It doesn't give the positives in percent, how to get it?
0
Entering edit mode

@mirzabiotech2000:

1. Column names can be found here for the outfmt=6: http://www.drive5.com/usearch/manual/blast6out.html More info: http://www.ncbi.nlm.nih.gov/books/NBK279675/

2. In outfmt 6 you can see the identity/accuracy as a percentage in field 3. Is this what you are after?

0
Entering edit mode