I am trying to create a script using biopython to parse the HIT_def identifier from NCBI tblastn. Not sure how to get the HIT_def into python. Attached is the xml and python code
The target ID maybe what I want. I have a list of peptides converted from the fasta reference file and each has an ID that matches the "Hit_def" attribute in the xml file that is output by tBLASTn. I want to get that attribute and print it to a text file so that ID can be compared to an excel spreadsheet containing all the peptides.
Something like this will allow you to get the query id and then the target id (if that is what you want?):
from Bio import SearchIO
E_VALUE_THRES = 0.01
with open('conesnail.xml', 'rU') as input:
for qresult in SearchIO.parse(input, "blast-xml"):
hits = qresult.hits
query_id = qresult.id
if len(hits)> 0:
target_id = hits[0].id
evalue = hits[0].hsps[0].evalue
if evalue < E_VALUE_THRES:
print("%s\t%s" % (query_id, target_id))
ADD COMMENT
• link
updated 5.5 years ago by
Ram
45k
•
written 9.5 years ago by
Jon
▴
360
0
Entering edit mode
Last thing, how would I also parse the protein alignment?
Its great you've provided a sample input file, but what do you want exactly as the output?
The target ID maybe what I want. I have a list of peptides converted from the fasta reference file and each has an ID that matches the "Hit_def" attribute in the xml file that is output by tBLASTn. I want to get that attribute and print it to a text file so that ID can be compared to an excel spreadsheet containing all the peptides.