Question: Extracting the <Hit_def> from Blast xml output using Biopython and saving in .csv
0
gravatar for Anushka
3.9 years ago by
Anushka20
France
Anushka20 wrote:

Hello,
I have the blast output in .xml form and I want to retrieve few attributes like <hit_def>. I found the parser on biophython.

CODE:
from Bio.Blast import NCBIXML
blast = NCBIXML.parse(open('output.xml', 'rU'))
for record in blast:
    for align in record.alignments:
        for hsp in align.hsps:
            print hsp.score, align.hit_def

Q: Above code is just printing the out put on the terminal. Could anyone help me how to store the output file in .csv format.

Specifically, I need output.csv with these attribute <Iteration_query-def>, <Hit_def>, <Hsp_score>, <Hsp_evalue> as columns, in a .csv format.

Q2: How can I to get the result just for the best hit of each query ? While running blastp setting -max_target_seqs to 1 will do the same?

Following is a segment of my input xml

      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_query-ID>Query_1</Iteration_query-ID>
      <Iteration_query-def>comp552019_c3_seq6_V2</Iteration_query-def>
      <Iteration_query-len>227</Iteration_query-len>
      <Iteration_hits>
        <Hit>
          <Hit_num>1</Hit_num>
          <Hit_id>gi|148727288|ref|NP_002327.2|</Hit_id>
          <Hit_def>low-density lipoprotein receptor-related protein 6 precursor [Homo sapiens] &gt;gi|578822872|ref|XP_006719141.1| PREDICTED: low-density lipoprotein receptor-related protein 6 isoform X1 [Homo sapiens]</Hit_def>
          <Hit_accession>NP_002327</Hit_accession>
          <Hit_len>1613</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>43.5133894476967</Hsp_bit-score>
              <Hsp_score>101</Hsp_score>
              <Hsp_evalue>0.000198686946331968</Hsp_evalue>
              <Hsp_query-from>43</Hsp_query-from>
              <Hsp_query-to>223</Hsp_query-to>
              <Hsp_hit-from>589</Hsp_hit-from>
              <Hsp_hit-to>767</Hsp_hit-to>
              <Hsp_query-frame>0</Hsp_query-frame>
              <Hsp_hit-frame>0</Hsp_hit-frame>
              <Hsp_identity>53</Hsp_identity>
              <Hsp_positive>79</Hsp_positive>
              <Hsp_gaps>24</Hsp_gaps>
              <Hsp_align-len>192</Hsp_align-len>
              <Hsp_qseq>TNEC--HDSKCEHICLARDAGGFVCKCSPGFTLVSGYK-CVSDSVTDDYILVADLGQKRLFQLPIRKST-----RNVGDLVAIDLDDVTDDRIYAASVIKKTGGLAWFDISAREIV--WGSKRLSRDDAVLSITTGCCNKKVYWTTQTGIYSWDGVSSTPDKLYSVSFFSDA-QIRQVVVDCKANLLYWIEY</Hsp_qseq>
              <Hsp_hseq>SNPCAEENGGCSHLCLYRPQG-LRCACPIGFELISDMKTCI---VPEAFLLFSRRADIRRISLETNNNNVAIPLTGVKEASALDFD-VTDNRIYWTDISLKTISRAFMNGSALEHVVEFGL------DYPEGMAVDWLGKNLYW-ADTGTNRIE-VSKLDGQHRQVLVWKDLDSPRALALDPAEGFMYWTEW</Hsp_hseq>
              <Hsp_midline>+N C   +  C H+CL R  G   C C  GF L+S  K C+   V + ++L +     R   L    +        V +  A+D D VTD+RIY   +  KT   A+ + SA E V  +G       D    +      K +YW   TG    + VS    +   V  + D    R + +D     +YW E+</Hsp_midline>
            </Hsp>
            <Hsp>
              <Hsp_num>2</Hsp_num>
              <Hsp_bit-score>39.6613936885231</Hsp_bit-score>
              <Hsp_score>91</Hsp_score>
              <Hsp_evalue>0.00402563881724524</Hsp_evalue>
              <Hsp_query-from>44</Hsp_query-from>
              <Hsp_query-to>128</Hsp_query-to>
              <Hsp_hit-from>891</Hsp_hit-from>
              <Hsp_hit-to>980</Hsp_hit-to>
              <Hsp_query-frame>0</Hsp_query-frame>
              <Hsp_hit-frame>0</Hsp_hit-frame>
              <Hsp_identity>34</Hsp_identity>
              <Hsp_positive>43</Hsp_positive>
              <Hsp_gaps>15</Hsp_gaps>
              <Hsp_align-len>95</Hsp_align-len>
              <Hsp_qseq>NECHDSK--CEHICLARDAGGFVCKCSPGFTLVSGYKCVSDSVTDDYI--------LVADLGQKRLFQLPIRKSTRNVGDLVAIDLDDVTDDRIY</Hsp_qseq>
              <Hsp_hseq>NECASSNGHCSHLCLAVPVGGFVCGCPAHYSLNADNRTCSAPTTFLLFSQKSAINRMVIDEQQSPDIILPIH-SLRNV---RAIDYDPL-DKQLY</Hsp_hseq>
              <Hsp_midline>NEC  S   C H+CLA   GGFVC C   ++L +  +  S   T            +V D  Q     LPI  S RNV    AID D + D ++Y</Hsp_midline>
            </Hsp>
          </Hit_hsps>

I would really appreciate your help.

Thanks

bioython blastp blast python • 2.1k views
ADD COMMENTlink modified 3.9 years ago by RamRS19k • written 3.9 years ago by Anushka20
1

using xsltproc rather than python would be straighforward.

ADD REPLYlink written 3.9 years ago by Pierre Lindenbaum114k
0
gravatar for RamRS
3.9 years ago by
RamRS19k
Houston, TX
RamRS19k wrote:

You could redirect output to a CSV file using File IO. Open a file in write mode and modify the print so it writes into the file. Here's one of many resources: http://www.pythonforbeginners.com/files/reading-and-writing-files-in-python

Google away for more. This link should help you get the attributes you require.

Q2: Best hit is an ambiguous term. Each hit can have multiple HSPs and you'd need to average or sum across HSP scores to find the "best" alignment.

 

ADD COMMENTlink written 3.9 years ago by RamRS19k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1842 users visited in the last hour