Question

Extracting Accession Numbers With Ncbixml.Parse

0

Entering edit mode

10.1 years ago

jed.tressler • 0

Greetings;

I am a first time python user, and am stumped. I am using the following code to parse my blast XML file, and everything is working great. The one thing I cant figure out is the correct addition to the "OUT.write" line to extract the <hit_accession> field. If anybody knows the correct object/argument, or even better a beginning user friendly list of objects I would really appreciate it. No amount of googling has availed me so far.

#!/usr/bin/env python

import sys
from Bio.Blast import NCBIXML
#Usage, opens an outfile and then parses any number of .xml files into that outfile, printing all hits
#parse_blastn.py outfile.txt anynumberofinfiles.xml
OUT = open(sys.argv[1], 'w')

OUT.write("Query Name\tQuery Length\tAlignment Title\tAlignment ID\tAlignment Def\teValue")
for xml_file in sys.argv[2:]:
    result_handle = open(xml_file)
    blast_records = NCBIXML.parse(result_handle)
    for rec in blast_records:
        for alignment in rec.alignments:
                for hsp in alignment.hsps:
                    OUT.write('\n'+ str(rec.query_id) + '\t' + str(rec.query_length) + '\t' + str(alignment.title) + '\t' + str(alignment.hit_id) + '\t' + str(alignment.hit_def) + '\t' + str(hsp.expect))

Thanks again,

python biopython • 8.3k views

ADD COMMENT • link 10.1 years ago by jed.tressler • 0

0

Entering edit mode

I'm sure you have your reasons, but it might have been simpler to ask BLAST+ to produce a tabular file with exactly those fields, something like this:

blastn -outfmt "6 qseqid qlen sacc stitle etc" etc

Anyway, could you post a snippet of the XML output and tell us which bit you are looking for when you say accession?

ADD REPLY • link 10.1 years ago by Peter 6.0k

0

Entering edit mode

Here is one iteration of the XML output.

<Iteration>
      <Iteration_iter-num>3</Iteration_iter-num>
      <Iteration_query-ID>comp37547_c0_seq1</Iteration_query-ID>
      <Iteration_query-def>No definition line</Iteration_query-def>
      <Iteration_query-len>532</Iteration_query-len>
      <Iteration_hits>
        <Hit>
          <Hit_num>1</Hit_num>
          <Hit_id>gi|91084931|ref|XP_970864.1|</Hit_id>
          <Hit_def>PREDICTED: similar to AGAP007103-PA [Tribolium castaneum] &gt;gi|270009292|gb|EFA05740.1| hypothetical protein TcasGA2_TC015622 [Tribolium castaneum]</Hit_def>
          *<Hit_accession>XP_970864</Hit_accession>
          <Hit_len>953</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>170.244</Hsp_bit-score>
              <Hsp_score>430</Hsp_score>
              <Hsp_evalue>8.31222e-46</Hsp_evalue>
              <Hsp_query-from>58</Hsp_query-from>
              <Hsp_query-to>528</Hsp_query-to>
              <Hsp_hit-from>274</Hsp_hit-from>
              <Hsp_hit-to>422</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>0</Hsp_hit-frame>
              <Hsp_identity>80</Hsp_identity>
              <Hsp_positive>110</Hsp_positive>
              <Hsp_gaps>10</Hsp_gaps>
              <Hsp_align-len>158</Hsp_align-len>
              <Hsp_qseq>IPQRLEYVPNSGRHKICENAHLDLCDKPYNIEKVSVRMVLTTKHIGKGCDRDTYSISSQRKLCGASGDSIDLLPSPSV-ASWTSSLPTDDGNESDQIFAFDGQTNAVEVPEGHFNHTLTNHFTISTWMKHERHPKKDNSQAPSKSHKAPKEHILCMSD</Hsp_qseq>
              <Hsp_hseq>LPERVDYAPTTGKQELFPSASLELCDVPCNVRELRTRVDLATRHIGKGCDRDTYSVQSQRKLCGASPSAIDLLPSPGVGAEWTKPLISDEGHESDQMFEFDGSSTAVSIPNSVLDHNLPSTFTIATWMRHGQHNGQD---------KRVKEHILCSAD</Hsp_hseq>
              <Hsp_midline>+P+R++Y P +G+ ++  +A L+LCD P N+ ++  R+ L T+HIGKGCDRDTYS+ SQRKLCGAS  +IDLLPSP V A WT  L +D+G+ESDQ+F FDG + AV +P    +H L + FTI+TWM+H +H  +D         K  KEHILC +D</Hsp_midline>
            </Hsp>
          </Hit_hsps>
        </Hit>
      </Iteration_hits>
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>17754510</Statistics_db-num>
          <Statistics_db-len>6092763244</Statistics_db-len>
          <Statistics_hsp-len>131</Statistics_hsp-len>
          <Statistics_eff-space>173278431964</Statistics_eff-space>
          <Statistics_kappa>0.041</Statistics_kappa>
          <Statistics_lambda>0.267</Statistics_lambda>
          <Statistics_entropy>0.14</Statistics_entropy>
        </Statistics>
      </Iteration_stat>
    </Iteration>

The Field of interest is 11 lines down, marked with an (*), <hit_accession>.

I agree that it would be a good deal easier to just run Blast+ with Outfmt 6 or 7, but it doesn't feed into the rest of the pipeline I have been given to work with. Once I get a better handle on python and perl I will likely just change the latter stages, but I'm starting with what at least looks to be the simpler piece first.

ADD REPLY • link 10.1 years ago by jed.tressler • 0

score 6 · Answer 1 · 2014-03-13

If you are looking for the contents of the <Hit_accession> tag, you would want alignment.accession in the code example:

#!/usr/bin/env python

import sys
from Bio.Blast import NCBIXML
#Usage, opens an outfile and then parses any number of .xml files into that outfile, printing all hits
#parse_blastn.py outfile.txt anynumberofinfiles.xml

OUT = open(sys.argv[1], 'w')
OUT.write("Query Name\tQuery Length\tAlignment Title\tAlignment ID\tAlignment Def\teValue\n")
for xml_file in sys.argv[2:]:
    result_handle = open(xml_file)
    blast_records = NCBIXML.parse(result_handle)
    for rec in blast_records:
        for alignment in rec.alignments:
            for hsp in alignment.hsps:
                fields = [rec.query_id, str(rec.query_length), alignment.title, alignment.hit_id,
                          alignment.hit_def, alignment.accession, str(hsp.expect)]
                OUT.write("\t".join(fields) + "\n")
OUT.close()

Note I also fixed the problem where your original failed to include a trailing new line on the final line of output, and explicitly closed the output handle.

The Biopython Tutorial has/had some UML diagrams of the BLAST parser objects, but personally I never found them very friendly.