I performed a BLAST search of a fasta file with multiple sequences using python. What I want to do now is to extract information and put it in a pandas dataframe. I want the query ID, the hit ID, and the accession number of the hit. Here's what I've done so far:
fasta_string = open("list.fasta").read()
result_handle = NCBIWWW.qblast("blastx", sequence = fasta_string, database = "refseq_protein",
entrez_query = 'txid9606[ORGN]')
with open("my_blast.xml", 'w') as out_handle:
out_handle.write(result_handle.read())
result_handle.close()
result_handle = open("my_blast.xml")
blast_records = NCBIXML.parse(result_handle)
qresults = SearchIO.parse('my_blast.xml', 'blast-xml')
search_dict = SearchIO.to_dict(qresults)
query_id = []
hit_list = []
tsv_output = pd.DataFrame(query_id) #Initialize pandas dataframe
for key, value in search_dict.items():
query_id.append(key)
hit_list.append(value)
I already added the Query ID to the pandas dataframe, now I'm looking to find some way to extract the ID of every result in hit_list
, which is a list of QueryResults. I've looked through the documentation (https://biopython.org/DIST/docs/api/Bio.SearchIO._model.query.QueryResult-class.html), but I don't see any way to extract the hit ID or the hit accession number. Does anyone know how I could do this?
Thank you