Extracting Accession Numbers With Ncbixml.Parse
1
0
Entering edit mode
10.1 years ago

Greetings;

I am a first time python user, and am stumped. I am using the following code to parse my blast XML file, and everything is working great. The one thing I cant figure out is the correct addition to the "OUT.write" line to extract the <hit_accession> field. If anybody knows the correct object/argument, or even better a beginning user friendly list of objects I would really appreciate it. No amount of googling has availed me so far.

#!/usr/bin/env python

import sys
from Bio.Blast import NCBIXML
#Usage, opens an outfile and then parses any number of .xml files into that outfile, printing all hits
#parse_blastn.py outfile.txt anynumberofinfiles.xml
OUT = open(sys.argv[1], 'w')

OUT.write("Query Name\tQuery Length\tAlignment Title\tAlignment ID\tAlignment Def\teValue")
for xml_file in sys.argv[2:]:
    result_handle = open(xml_file)
    blast_records = NCBIXML.parse(result_handle)
    for rec in blast_records:
        for alignment in rec.alignments:
                for hsp in alignment.hsps:
                    OUT.write('\n'+ str(rec.query_id) + '\t' + str(rec.query_length) + '\t' + str(alignment.title) + '\t' + str(alignment.hit_id) + '\t' + str(alignment.hit_def) + '\t' + str(hsp.expect))

Thanks again,

python biopython • 8.3k views
ADD COMMENT
0
Entering edit mode

I'm sure you have your reasons, but it might have been simpler to ask BLAST+ to produce a tabular file with exactly those fields, something like this:

blastn -outfmt "6 qseqid qlen sacc stitle etc" etc

Anyway, could you post a snippet of the XML output and tell us which bit you are looking for when you say accession?

ADD REPLY
0
Entering edit mode

Here is one iteration of the XML output.

<Iteration>
      <Iteration_iter-num>3</Iteration_iter-num>
      <Iteration_query-ID>comp37547_c0_seq1</Iteration_query-ID>
      <Iteration_query-def>No definition line</Iteration_query-def>
      <Iteration_query-len>532</Iteration_query-len>
      <Iteration_hits>
        <Hit>
          <Hit_num>1</Hit_num>
          <Hit_id>gi|91084931|ref|XP_970864.1|</Hit_id>
          <Hit_def>PREDICTED: similar to AGAP007103-PA [Tribolium castaneum] &gt;gi|270009292|gb|EFA05740.1| hypothetical protein TcasGA2_TC015622 [Tribolium castaneum]</Hit_def>
          *<Hit_accession>XP_970864</Hit_accession>
          <Hit_len>953</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>170.244</Hsp_bit-score>
              <Hsp_score>430</Hsp_score>
              <Hsp_evalue>8.31222e-46</Hsp_evalue>
              <Hsp_query-from>58</Hsp_query-from>
              <Hsp_query-to>528</Hsp_query-to>
              <Hsp_hit-from>274</Hsp_hit-from>
              <Hsp_hit-to>422</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>0</Hsp_hit-frame>
              <Hsp_identity>80</Hsp_identity>
              <Hsp_positive>110</Hsp_positive>
              <Hsp_gaps>10</Hsp_gaps>
              <Hsp_align-len>158</Hsp_align-len>
              <Hsp_qseq>IPQRLEYVPNSGRHKICENAHLDLCDKPYNIEKVSVRMVLTTKHIGKGCDRDTYSISSQRKLCGASGDSIDLLPSPSV-ASWTSSLPTDDGNESDQIFAFDGQTNAVEVPEGHFNHTLTNHFTISTWMKHERHPKKDNSQAPSKSHKAPKEHILCMSD</Hsp_qseq>
              <Hsp_hseq>LPERVDYAPTTGKQELFPSASLELCDVPCNVRELRTRVDLATRHIGKGCDRDTYSVQSQRKLCGASPSAIDLLPSPGVGAEWTKPLISDEGHESDQMFEFDGSSTAVSIPNSVLDHNLPSTFTIATWMRHGQHNGQD---------KRVKEHILCSAD</Hsp_hseq>
              <Hsp_midline>+P+R++Y P +G+ ++  +A L+LCD P N+ ++  R+ L T+HIGKGCDRDTYS+ SQRKLCGAS  +IDLLPSP V A WT  L +D+G+ESDQ+F FDG + AV +P    +H L + FTI+TWM+H +H  +D         K  KEHILC +D</Hsp_midline>
            </Hsp>
          </Hit_hsps>
        </Hit>
      </Iteration_hits>
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>17754510</Statistics_db-num>
          <Statistics_db-len>6092763244</Statistics_db-len>
          <Statistics_hsp-len>131</Statistics_hsp-len>
          <Statistics_eff-space>173278431964</Statistics_eff-space>
          <Statistics_kappa>0.041</Statistics_kappa>
          <Statistics_lambda>0.267</Statistics_lambda>
          <Statistics_entropy>0.14</Statistics_entropy>
        </Statistics>
      </Iteration_stat>
    </Iteration>

The Field of interest is 11 lines down, marked with an (*), <hit_accession>.

I agree that it would be a good deal easier to just run Blast+ with Outfmt 6 or 7, but it doesn't feed into the rest of the pipeline I have been given to work with. Once I get a better handle on python and perl I will likely just change the latter stages, but I'm starting with what at least looks to be the simpler piece first.

ADD REPLY
6
Entering edit mode
10.1 years ago
Peter 6.0k

If you are looking for the contents of the <Hit_accession> tag, you would want alignment.accession in the code example:

#!/usr/bin/env python

import sys
from Bio.Blast import NCBIXML
#Usage, opens an outfile and then parses any number of .xml files into that outfile, printing all hits
#parse_blastn.py outfile.txt anynumberofinfiles.xml

OUT = open(sys.argv[1], 'w')
OUT.write("Query Name\tQuery Length\tAlignment Title\tAlignment ID\tAlignment Def\teValue\n")
for xml_file in sys.argv[2:]:
    result_handle = open(xml_file)
    blast_records = NCBIXML.parse(result_handle)
    for rec in blast_records:
        for alignment in rec.alignments:
            for hsp in alignment.hsps:
                fields = [rec.query_id, str(rec.query_length), alignment.title, alignment.hit_id,
                          alignment.hit_def, alignment.accession, str(hsp.expect)]
                OUT.write("\t".join(fields) + "\n")
OUT.close()

Note I also fixed the problem where your original failed to include a trailing new line on the final line of output, and explicitly closed the output handle.

The Biopython Tutorial has/had some UML diagrams of the BLAST parser objects, but personally I never found them very friendly.

ADD COMMENT
0
Entering edit mode

Awesome! Thanks so much.

ADD REPLY

Login before adding your answer.

Traffic: 1782 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6