Hello, I am trying to use Bio.Blast to perfrom blast similar to NCBI web-based blast. I got code from previous post and revised some to output the data in excel spreadsheet. I tried to have info of coverage and identity as well, which is why I have parameters of query_length, hsp.identities, and hsp.align_length.
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
fasta_files = ['Prakit_EN117_981R_G05']
for i in fasta_files:
import csv
csvData = [['ID','score','bits','E-score','hsp.identities','hsp.align_length','query_length']]
with open (i+'.csv','wb') as csvFile:
writer = csv.writer(csvFile)
writer.writerows(csvData)
csvFile.close()
fasta_string = open(i+".fasta").read()
query=''.join(fasta_string.split('\n')[1:])
query_length = len(query)
result_handle = NCBIWWW.qblast("blastn", "nr", fasta_string)
with open(i+".xml","w")as out_handle:
out_handle.write(result_handle.read())
result_handle.close()
result_handle = open(i+".xml")
blast_record = NCBIXML.read(result_handle)
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
cal_pidentity = float(hsp.identities/hsp.align_length)
import csv
row = [alignment.title,str(hsp.score),str(hsp.bits),str(hsp.expect),str(hsp.identities),str(hsp.align_length),str(query_length)]
with open(i+'.csv','ab') as csvFile:
writer = csv.writer(csvFile)
writer.writerow(row)
csvFile.close()
I successfully obtained the output. However, it raises many questions.
1.) When I further use hsp.identities, hsp.align_lenth, and query_length to manually calculated coverage and percent identity, there was some hit showing >100%coverage.
2.) I dont understand how the Bio.Blast sort the results because in my excel spreadsheet, some hits at the lower rows show higher calculated percent identity than upper rows.
3.) In overall, the results in my excel spreadsheet looks a bit different from results from NCBI web-based, using the same query.
How can I solve this problem?
Thank you so much
can you confirm that you use the same database and param settings as the only NBCI blast?
inserting code is simply done by pasting it in your post, then select it and format it with the icon on top of the edit window:
However if it's large amount of code it might be more useful (advisable) to use something like code-snippets from github gist for instance , some more info here
why ... but why? :-)
Hello lieven.sterck, I just found out that I can export .xml file from NCBI web-based too. Then, I compared details of parameters recorded in .xml files of NCBI web-based and the Bio.Blast I ran. They were different (i.e. match, mismatch, gap-open, gap-extend). Probably, that is reason why it is different?
Hello lieven.sterck 1.) Thank you for the tips, I am looking into it now to revise the post. 2.) I only confirmed that they are the same database and same query but I didnt check other param settings yet ( I actually dont know how to). Could you guide me how to? 3.) I want to output the excel spreadsheet for conveniences in further process. I checked NCBI web-based, they have option to export spreadsheet but the hit title is not included.
Thank you so much
On web blast page any blast algorithm you use has an expandable section towards the bottom of the screen that says
Algorithm Parameters
. Expand that section and make a note of defaults blast is using. You can then use them locally to reproduce web results.While it is fine to export blast results as tab- or comma-limited value data using excel to analyze them is fraught with problems. If someone just needs to "view" that data then that application may be fine.
Hello, Thank you so much. I have one question. It seems like NCBI web-based blast using algorithm based on the article different from the one used in Bio.Blast, in that case, can we change the algorithm to be according to the NCBI web-based?
Blast+
algorithms should be identical (i.e. if you are referring toblastn
).The BioPython cookbook FAQ #19 suggests the above, which is what Lieven was getting at.