Question

Python SearchIO: Extracting information from QueryResults?

0

Entering edit mode

6.2 years ago

westin.kosater ▴ 80

I performed a BLAST search of a fasta file with multiple sequences using python. What I want to do now is to extract information and put it in a pandas dataframe. I want the query ID, the hit ID, and the accession number of the hit. Here's what I've done so far:

fasta_string = open("list.fasta").read()
result_handle = NCBIWWW.qblast("blastx", sequence = fasta_string, database = "refseq_protein",
                               entrez_query = 'txid9606[ORGN]')

with open("my_blast.xml", 'w') as out_handle:
    out_handle.write(result_handle.read())
    result_handle.close()

result_handle = open("my_blast.xml")

blast_records = NCBIXML.parse(result_handle)

qresults = SearchIO.parse('my_blast.xml', 'blast-xml')

search_dict = SearchIO.to_dict(qresults)
query_id = []
hit_list = []

tsv_output = pd.DataFrame(query_id) #Initialize pandas dataframe

for key, value in search_dict.items():
    query_id.append(key)
    hit_list.append(value)

I already added the Query ID to the pandas dataframe, now I'm looking to find some way to extract the ID of every result in hit_list, which is a list of QueryResults. I've looked through the documentation (https://biopython.org/DIST/docs/api/Bio.SearchIO._model.query.QueryResult-class.html), but I don't see any way to extract the hit ID or the hit accession number. Does anyone know how I could do this?

Thank you

python blast xml • 1.8k views

ADD COMMENT • link 6.2 years ago by westin.kosater ▴ 80