I'm looking for a way to do BLAST simultaneously with >1 sequence. The input is several fasta files or one file containing several sequence formatted in fasta. For an initial testing, I want to try obtaining the top 10 blast results for each input sequence and output them to a text file (one text file for each input sequence, or one big file containing all). But currently, I'm drawing a blank. Is there a way to do this with (preferably) biopython?
you can do this quite easily via the commandline. assuming your query and subject sequences are not protein. here's a shell script, just adjust your QUERY and SUBJECT
QUERY=some.fasta SUBJECT=/path/to/other.fasta formatdb -p F -i $SUBJECT blastall -p blastn -i $QUERY -d $SUBJECT -K 10 -m 8 -a 8 > blast.out
the -K 10 keeps only the 10 best hits -m 8 makes the output in tab-delimited format and -a 8 uses 8 processors. so blast.out will contain (up to) 10 best hits for each query.
If you want to do it over the internet using blastn and multiple fasta files it would look something like this:
from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML file_string = "" x = 1 fasta_files = [fasta1, fasta2, fasta3, fasta4] for i in fasta_files: File = open("output"+x+".txt","w") fasta_string = open(i+".fasta").read() #or make the names fasta1.fasta and just do open(i).read result_handle = NCBIWWW.qblast("blastn", "nr", fasta_string, hitlist_size=10) blast_records = NCBIXML.parse(result_handle) #or blast_record = NCBIXML.read(result_handle) if you only have one seq in file E_VALUE_THRESH = 0.001 for blast_record in blast_records: for alignment in blast_record.alignments: for hsp in alignment.hsps: if hsp.expect < E_VALUE_THRESH: file_string += "alignment:",alignment.title+"\n" file_string += "e-value:",hsp.expect+"\n" x += 1 File.write(file_string)
Not sure if this was what you meant and I haven't tested it so may be some little faults in it, but it should take several different files of fasta sequences and write the results as a file as output1.txt, output2.txt etc. For multiple sequences in one file you can use the blast_records part
Hope this helps.
As far as I remember, the NCBI blastall command takes any number of sequences for its input. In http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/blastall_node26.html
The query must be in FASTA format. If multiple entries are in the input file, all queries will be searched.
Then, there are many engines/parsers for digesting the BLAST output . On my side, I would use the XML output, insert the results in a database via XSLT and select the 10 best hits.
I used the above commands for local blast and it worked wonderfully, I just needed to replace -K with -b. Thank you. However, I have the following questions about this commands,
"blastall -p blastn -i $QUERY -d $SUBJECT -b 10 -m 8 -a 8 > blast.out"
1. The result does not show any headings for the columns, e-value & score columns r easily identifiable but I don't understand the other results. How can I get the headings?
2. It doesn't give the positives in percent, how to get it?