Bio.SearchIO.HmmerIO only showing the best score domain
1
2
Entering edit mode
8.1 years ago

Hi everyone,

I am trying to parse a table file I obtained with hmmscan by comparing a set of fastas against the whole Pfam database of protein domains. I am trying to use HmmerIO in Python to parse the ouput, but it seems to me that, for each fasta, it only keeps the domain with the highest evalue, while I want to keep all of those with a significant evalue.

Example:

Input file:

    #                                                                             --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name                         accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ----------               -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
GST_N                PF02798.17 gi|528981796|ref|NC_021285.1|_5567 -            4.1e-16   58.9   0.0   6.9e-16   58.2   0.0   1.4   1   0   0   1   1   1   1 Glutathione S-transferase, N-terminal domain
GST_N_2              PF13409.3  gi|528981796|ref|NC_021285.1|_5567 -            3.3e-14   52.8   0.0   5.8e-14   52.0   0.0   1.4   1   0   0   1   1   1   1 Glutathione S-transferase, N-terminal domain
GST_N_3              PF13417.3  gi|528981796|ref|NC_021285.1|_5567 -            3.4e-14   52.8   0.0   5.4e-14   52.2   0.0   1.3   1   0   0   1   1   1   1 Glutathione S-transferase, N-terminal domain
GST_C                PF00043.22 gi|528981796|ref|NC_021285.1|_5567 -              7e-12   45.3   0.0   9.6e-12   44.8   0.0   1.2   1   0   0   1   1   1   1 Glutathione S-transferase, C-terminal domain
GST_C_2              PF13410.3  gi|528981796|ref|NC_021285.1|_5567 -            6.1e-07   29.2   0.4   9.3e-07   28.6   0.4   1.3   1   0   0   1   1   1   1 Glutathione S-transferase, C-terminal domain
GST_C_3              PF14497.3  gi|528981796|ref|NC_021285.1|_5567 -            0.00011   22.2   0.0   0.00017   21.6   0.0   1.2   1   0   0   1   1   1   1 Glutathione S-transferase, C-terminal domain

Here I found several statistically significant domains for the PF02798.17 gi|528981796|ref|NC_021285.1|_5567 query (fasta sequence), but the SearchIO output I have been able to get looks like this:

for qresult in SearchIO.parse(file1, 'hmmer3-tab'):
    print qresult.hits[0]

Query: gi|528981796|ref|NC_021285.1|_5567
       <unknown description>
  Hit: GST_N
       Glutathione S-transferase, N-terminal domain
 HSPs: ----  --------  ---------  ------  ---------------  ---------------------
          #   E-value  Bit score    Span      Query range              Hit range
       ----  --------  ---------  ------  ---------------  ---------------------
          0   6.9e-16      58.20       ?      [None:None]            [None:None]

I have no idea of how to access the other hits, even if I am quite sure there has to be a way.

Anyone has any experience with this?

Thanks a lot.

python hmmer • 3.2k views
ADD COMMENT
5
Entering edit mode
8.1 years ago

I found the answer myself, it was a dumb coding mistake... If anyone ever needs it, the point is that every fasta sequence can have different hits and hits[0] will only show the first of the list.

The correct way to proceed is:

for qresult in SearchIO.parse(file1, 'hmmer3-tab'):
     for item in qresult.hits:
          print item
          print item.evalue # this will print the evalue of each domain hit
ADD COMMENT
0
Entering edit mode

Is there a way I can print it to a file as a tab delimited.

Query, target name, filtered(evalue), qlen ? ##filtered evalue is nothing but i would like output below a specific evalue

Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 3825 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6