Greetings;
I am a first time python user, and am stumped. I am using the following code to parse my blast XML file, and everything is working great. The one thing I cant figure out is the correct addition to the "OUT.write" line to extract the <hit_accession> field. If anybody knows the correct object/argument, or even better a beginning user friendly list of objects I would really appreciate it. No amount of googling has availed me so far.
#!/usr/bin/env python
import sys
from Bio.Blast import NCBIXML
#Usage, opens an outfile and then parses any number of .xml files into that outfile, printing all hits
#parse_blastn.py outfile.txt anynumberofinfiles.xml
OUT = open(sys.argv[1], 'w')
OUT.write("Query Name\tQuery Length\tAlignment Title\tAlignment ID\tAlignment Def\teValue")
for xml_file in sys.argv[2:]:
result_handle = open(xml_file)
blast_records = NCBIXML.parse(result_handle)
for rec in blast_records:
for alignment in rec.alignments:
for hsp in alignment.hsps:
OUT.write('\n'+ str(rec.query_id) + '\t' + str(rec.query_length) + '\t' + str(alignment.title) + '\t' + str(alignment.hit_id) + '\t' + str(alignment.hit_def) + '\t' + str(hsp.expect))
Thanks again,
I'm sure you have your reasons, but it might have been simpler to ask BLAST+ to produce a tabular file with exactly those fields, something like this:
Anyway, could you post a snippet of the XML output and tell us which bit you are looking for when you say accession?
Here is one iteration of the XML output.
The Field of interest is 11 lines down, marked with an (*), <hit_accession>.
I agree that it would be a good deal easier to just run Blast+ with Outfmt 6 or 7, but it doesn't feed into the rest of the pipeline I have been given to work with. Once I get a better handle on python and perl I will likely just change the latter stages, but I'm starting with what at least looks to be the simpler piece first.