Question: Extracting Accession Numbers With Ncbixml.Parse
0
gravatar for jed.tressler
5.2 years ago by
jed.tressler0 wrote:

Greetings;

I am a first time python user, and am stumped. I am using the following code to parse my blast XML file, and everything is working great. The one thing I cant figure out is the correct addition to the "OUT.write" line to extract the <hit_accession> field. If anybody knows the correct object/argument, or even better a beginning user friendly list of objects I would really appreciate it. No amount of googling has availed me so far.

#!/usr/bin/env python

import sys
from Bio.Blast import NCBIXML
#Usage, opens an outfile and then parses any number of .xml files into that outfile, printing all hits
#parse_blastn.py outfile.txt anynumberofinfiles.xml
OUT = open(sys.argv[1], 'w')

OUT.write("Query Name\tQuery Length\tAlignment Title\tAlignment ID\tAlignment Def\teValue")
for xml_file in sys.argv[2:]:
    result_handle = open(xml_file)
    blast_records = NCBIXML.parse(result_handle)
    for rec in blast_records:
        for alignment in rec.alignments:
                for hsp in alignment.hsps:
                    OUT.write('\n'+ str(rec.query_id) + '\t' + str(rec.query_length) + '\t' + str(alignment.title) + '\t' + str(alignment.hit_id) + '\t' + str(alignment.hit_def) + '\t' + str(hsp.expect))

Thanks again,

python biopython • 4.3k views
ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by jed.tressler0

I'm sure you have your reasons, but it might have been simpler to ask BLAST+ to produce a tabular file with exactly those fields, something like this:

blastn -outfmt "6 qseqid qlen sacc stitle etc" etc

Anyway, could you post a snippet of the XML output and tell us which bit you are looking for when you say accession?

ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by Peter5.8k

Here is one iteration of the XML output.

<Iteration>
      <Iteration_iter-num>3</Iteration_iter-num>
      <Iteration_query-ID>comp37547_c0_seq1</Iteration_query-ID>
      <Iteration_query-def>No definition line</Iteration_query-def>
      <Iteration_query-len>532</Iteration_query-len>
      <Iteration_hits>
        <Hit>
          <Hit_num>1</Hit_num>
          <Hit_id>gi|91084931|ref|XP_970864.1|</Hit_id>
          <Hit_def>PREDICTED: similar to AGAP007103-PA [Tribolium castaneum] &gt;gi|270009292|gb|EFA05740.1| hypothetical protein TcasGA2_TC015622 [Tribolium castaneum]</Hit_def>
          *<Hit_accession>XP_970864</Hit_accession>
          <Hit_len>953</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>170.244</Hsp_bit-score>
              <Hsp_score>430</Hsp_score>
              <Hsp_evalue>8.31222e-46</Hsp_evalue>
              <Hsp_query-from>58</Hsp_query-from>
              <Hsp_query-to>528</Hsp_query-to>
              <Hsp_hit-from>274</Hsp_hit-from>
              <Hsp_hit-to>422</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>0</Hsp_hit-frame>
              <Hsp_identity>80</Hsp_identity>
              <Hsp_positive>110</Hsp_positive>
              <Hsp_gaps>10</Hsp_gaps>
              <Hsp_align-len>158</Hsp_align-len>
              <Hsp_qseq>IPQRLEYVPNSGRHKICENAHLDLCDKPYNIEKVSVRMVLTTKHIGKGCDRDTYSISSQRKLCGASGDSIDLLPSPSV-ASWTSSLPTDDGNESDQIFAFDGQTNAVEVPEGHFNHTLTNHFTISTWMKHERHPKKDNSQAPSKSHKAPKEHILCMSD</Hsp_qseq>
              <Hsp_hseq>LPERVDYAPTTGKQELFPSASLELCDVPCNVRELRTRVDLATRHIGKGCDRDTYSVQSQRKLCGASPSAIDLLPSPGVGAEWTKPLISDEGHESDQMFEFDGSSTAVSIPNSVLDHNLPSTFTIATWMRHGQHNGQD---------KRVKEHILCSAD</Hsp_hseq>
              <Hsp_midline>+P+R++Y P +G+ ++  +A L+LCD P N+ ++  R+ L T+HIGKGCDRDTYS+ SQRKLCGAS  +IDLLPSP V A WT  L +D+G+ESDQ+F FDG + AV +P    +H L + FTI+TWM+H +H  +D         K  KEHILC +D</Hsp_midline>
            </Hsp>
          </Hit_hsps>
        </Hit>
      </Iteration_hits>
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>17754510</Statistics_db-num>
          <Statistics_db-len>6092763244</Statistics_db-len>
          <Statistics_hsp-len>131</Statistics_hsp-len>
          <Statistics_eff-space>173278431964</Statistics_eff-space>
          <Statistics_kappa>0.041</Statistics_kappa>
          <Statistics_lambda>0.267</Statistics_lambda>
          <Statistics_entropy>0.14</Statistics_entropy>
        </Statistics>
      </Iteration_stat>
    </Iteration>

The Field of interest is 11 lines down, marked with an (*), <hit_accession>.

I agree that it would be a good deal easier to just run Blast+ with Outfmt 6 or 7, but it doesn't feed into the rest of the pipeline I have been given to work with. Once I get a better handle on python and perl I will likely just change the latter stages, but I'm starting with what at least looks to be the simpler piece first.

ADD REPLYlink written 5.2 years ago by jed.tressler0
2
gravatar for Peter
5.2 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

If you are looking for the contents of the <Hit_accession> tag, you would want alignment.accession in the code example:

#!/usr/bin/env python

import sys
from Bio.Blast import NCBIXML
#Usage, opens an outfile and then parses any number of .xml files into that outfile, printing all hits
#parse_blastn.py outfile.txt anynumberofinfiles.xml

OUT = open(sys.argv[1], 'w')
OUT.write("Query Name\tQuery Length\tAlignment Title\tAlignment ID\tAlignment Def\teValue\n")
for xml_file in sys.argv[2:]:
    result_handle = open(xml_file)
    blast_records = NCBIXML.parse(result_handle)
    for rec in blast_records:
        for alignment in rec.alignments:
            for hsp in alignment.hsps:
                fields = [rec.query_id, str(rec.query_length), alignment.title, alignment.hit_id,
                          alignment.hit_def, alignment.accession, str(hsp.expect)]
                OUT.write("\t".join(fields) + "\n")
OUT.close()

Note I also fixed the problem where your original failed to include a trailing new line on the final line of output, and explicitly closed the output handle.

The Biopython Tutorial has/had some UML diagrams of the BLAST parser objects, but personally I never found them very friendly.

ADD COMMENTlink written 5.2 years ago by Peter5.8k

Awesome! Thanks so much.

ADD REPLYlink written 5.2 years ago by jed.tressler0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 672 users visited in the last hour