Question: Filtering HmmScan Domain Results
1
gravatar for hewaholax
2.2 years ago by
hewaholax10
hewaholax10 wrote:

Hi there,

Nowadays, I am trying to work on finding the protein domains by using hmmscan (HMMER 3.2.1). So, I have a file that multiple protein sequences exist and downloaded Pfam.A.hmm and this file. Then I followed the "User's Guide" of Hmmer which was released on here to install it properly.
Here are my steps after installation:

hmmbuild Pfam-A.hmm Pfam-A.seed
hmmpress Pfam-A.hmm 
hmmscan <my_sequence_file> Pfam-A.hmm

Could you inform me if I made mistake so far, please?

Then, I got these results:

enter image description here

Now, I couldn't find a proper values to filter them and configure the threshold. At this point, I tried some values as a threshold for both e-value and score but the results were not satisfactory.
Actually, I'm wondering that what kind of elimination was used on Hmmscan online service. Even though I read some pages such as Hmmer web services docs and here to filter and determine a threshold, I couldn't understand what I should do.

I will be waiting your replies.

Thanks in advance...

ADD COMMENTlink modified 17 months ago by Biostar ♦♦ 20 • written 2.2 years ago by hewaholax10
1

Have you looked at the output of the -h help for the tools?

You need to be more specific than 'not satisfactory' - what are you looking for? What thresholds need to be applied? Those hits look very good to me as far as I can tell?

hmmscan has 2 sections about customisable thresholds:

Options controlling reporting thresholds:
  -E <x>     : report models <= this E-value threshold in output  [10.0]  (x>0)
  -T <x>     : report models >= this score threshold in output
  --domE <x> : report domains <= this E-value threshold in output  [10.0]  (x>0)
  --domT <x> : report domains >= this score cutoff in output

Options controlling inclusion (significance) thresholds:
  --incE <x>    : consider models <= this E-value threshold as significant
  --incT <x>    : consider models >= this score threshold as significant
  --incdomE <x> : consider domains <= this E-value threshold as significant
  --incdomT <x> : consider domains >= this score threshold as significant
ADD REPLYlink written 2.2 years ago by Joe18k
5
gravatar for a.zielezinski
2.2 years ago by
a.zielezinski9.2k
a.zielezinski9.2k wrote:

The perl script, pfam_scan.pl (from EBI) was developed for this exact purpose.

pfam_scan.pl searches one or more sequences for matching Pfam domains using standard hmmscan. In addition, the program groups together families which have a common evolutionary ancestor in clans. Where there are overlapping matches within a clan, pfam_scan.pl will only show the most significant (the lowest e-value) match within the clan. The same clan filtering step is performed on the Pfam website (https://pfam.xfam.org). As a result, you get a list of non-overlapping and most significant protein domains for each of your query sequence.

ADD COMMENTlink written 2.2 years ago by a.zielezinski9.2k

First of all, thanks for your clear response! I will try to use it. If I face a problem, I'll ask again. Thanks again ...

ADD REPLYlink written 2.2 years ago by hewaholax10
1
gravatar for h.mon
2.2 years ago by
h.mon31k
Brazil
h.mon31k wrote:

It is difficult to find "proper" values to filter, because they are protein domain-dependent and taxon-dependent. See, for example, this discussion from dbCAN:

** About what E-value and Coverage cutoff thresholds you should use (in order to further parse yourfile.out.dm.ps file), we have done some evaluation analyses using arabidopsis, rice, Aspergillus nidulans FGSC A4, Saccharomyces cerevisiae S288c and Escherichia coli K-12 MG1655, Clostridium thermocellum ATCC 27405 and Anaerocellum thermophilum DSM 6725. Our suggestion is that for plants, use E-value < 1e-23 and coverage > 0.2; for bacteria, use E-value < 1e-18 and coverage > 0.35; and for fungi, use E-value < 1e-17 and coverage > 0.45.

** We have also performed evaluation for the five CAZyme classes separately, which suggests that the best threshold varies for different CAZyme classes (please see http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4132414/ for details). Basically to annotate GH proteins, one should use a very relax coverage cutoff or the sensitivity will be low (Supplementary Tables S4 and S9); (ii) to annotate CE families a very stringent E-value cutoff and coverage cutoff should be used; otherwise the precision will be very low due to a very high false positive rate (Supplementary Tables S5 and S10)

Which means you will have to find values from the literature, or explore yourself until you find values appropriate for you purposes.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by h.mon31k

I got it and also thanks for your reply. I'll check the literature.

ADD REPLYlink written 2.2 years ago by hewaholax10

Hi, I'm working on a plant genome (close to Arabidopsis) the value must be E-value < 1e-23 and coverage > 0.2 But I'm a bit confused with the options of hmmscan as described here by @jrj.healey
what should be the values and the combination of these options for hmmscan : -E, and/or -T and/or --domE....--incdomT ?

ADD REPLYlink modified 21 months ago • written 21 months ago by alexis.groppi30

After some test, my command line :

hmmscan --cpu 20 --domtblout PFAM_filtered.out -E 1e-23 --domE 0.2 Pfam-A.hmm file.fasta > pfam_filtered.log

Works perfectly

ADD REPLYlink modified 21 months ago • written 21 months ago by alexis.groppi30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 825 users visited in the last hour