Question

How to choose between --cut_ga, --cut_tc, and --cut_nc in hmmsearch? (hmmer)

1

Entering edit mode

4.0 years ago

O.rka ▴ 710

I'm trying to decide what would be the most appropriate threshold criteria? In particular, this is for PFAMs and a human metagenomics dataset. I'm curious what people have done in the past? It looks like anvio uses --cut_ga as the default. https://github.com/merenlab/anvio/issues/498

What been generally accepted as the most appropriate for metagenomics?

   --cut_ga
          Use Pfam GA (gathering threshold) score cutoffs.  Equivalent  to
          --globT <GA1> --domT <GA2>, but the GA1 and GA2 cutoffs are read
          from the HMM file. hmmbuild puts  these  cutoffs  there  if  the
          alignment file was annotated in a Pfam-friendly alignment format
          (extended SELEX or Stockholm format) and the optional GA annota-
          tion  line  was present. If these cutoffs are not set in the HMM
          file, --cut_ga doesn't work.


   --cut_tc
          Use Pfam  TC  (trusted  cutoff)  score  cutoffs.  Equivalent  to
          --globT <TC1> --domT <TC2>, but the TC1 and TC2 cutoffs are read
          from the HMM file. hmmbuild puts  these  cutoffs  there  if  the
          alignment file was annotated in a Pfam-friendly alignment format
          (extended SELEX or Stockholm format) and the optional TC annota-
          tion  line  was present. If these cutoffs are not set in the HMM
          file, --cut_tc doesn't work.


   --cut_nc
          Use Pfam NC (noise cutoff) score cutoffs. Equivalent to  --globT
          <NC1>  --domT  <NC2>,  but the NC1 and NC2 cutoffs are read from
          the HMM file. hmmbuild puts these cutoffs there if the alignment
          file was annotated in a Pfam-friendly alignment format (extended
          SELEX or Stockholm format) and the optional NC  annotation  line
          was  present.  If  these  cutoffs  are  not set in the HMM file,
          --cut_nc doesn't work.

http://www.cbs.dtu.dk/cgi-bin/nph-runsafe?man=hmmsearch

alignment hmm protein domain • 4.6k views

ADD COMMENT • link updated 4.0 years ago by Mensur Dlakic ★ 27k • written 4.0 years ago by O.rka ▴ 710

score 2 · Accepted Answer · 2020-04-04

2

Entering edit mode

4.0 years ago

Mensur Dlakic ★ 27k

As explained above, none of these options will matter unless your HMM has the cutoffs set. The appropriate lines in HMMs look like this:

GA    25.00 25.00;
TC    25.00 25.00;
NC    24.90 24.90;

These bit-scores have to be set manually during model building, and their main purpose is to catch borderline matches that may be true hits but have statistically insignificant E-values. On rare occasions, the use of these cutoffs would eliminate a match that has barely significant E-value. A simple way of setting these scores is to pick the worst score in a known group of trusted family members.

Most of the time you will get the same result from these scores or from E-values, and I use the latter as a guide. It is good practice to manually inspect hits that are just above or just below the E-value threshold, and to set the database size (-Z switch) to a fixed number so that searches done over large period of time and with different databases can be compared. If you check how Pfam does it, you will notice that they are not using cutoffs even though the HMMs support them. Instead, they set the database size to 45638612.

ADD COMMENT • link 4.0 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

I noticed a lot of wrappers use the -Z flag. How exactly would this influence the results?

ADD REPLY • link 4.0 years ago by O.rka ▴ 710

0

Entering edit mode

To answer your question, we need to define that the E-value is the product of the number of tests and the P-value. For biological sequence comparisons, the number of tests (N) is equivalent to the number of database sequences.

Let's say you search an HMM against a database with 1 million sequences and find a match to SSDZ_HUMAN with E=1e-20. If you search the same HMM against a database with 10 million sequences that contains SSDZ_HUMAN -- or you perform a search 3 years later and the database has grown 10x -- this sequence will now have E=1e-19. The -Z switch specifies the N that will be used for E-value calculation, regardless of the actual number of database sequences.

A consequence of constant sequence growth is that E-values for the same match get less significant. That obviously matters little if our match had E=1e-20 to begin with. However, if our match has E=0.01 and the database grows 5x, next time we search its E-value will be statistically insignificant.

ADD REPLY • link 4.0 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thank you, this is extremely helpful. When you're saying database size do you mean the number of HMMs in an HMM database or the number of sequences used to create the database?

Should I use the -Z flag used to build each PFAM individually for each hmmsearch run? For example, PF10417.9 there is this field in the PFAM of the combined database SM hmmsearch -Z 45638612 -E 1000 --cpu 4 HMM pfamseq. Are these the recommended settings when running the PFAM? If so, do I need to adjust this manually for each run?

ADD REPLY • link 4.0 years ago by O.rka ▴ 710

1

Entering edit mode

N is always the number of sequences/HMMs in a database that is being searched. If you are using hmmsearch, that will be the number of individual proteins in your FASTa database, unless you specify differently via the -Z switch. If you are using hmmpfam or hmmscan, N will be the number of HMMs in your target database unless overridden via -Z.

There is no right or wrong value for -Z -- it is a matter of maintaining the consistency of scores/E-values over a period of time. I don't know exactly why Pfam folks chose -Z 45638612, but it is a safe bet that at one point that was the size of their FASTa database and they decided to stick with it. I don't know what -Z you should pick, but 45638612 is as good a choice as any. For protein sequence database, any -Z number that is in tens of millions should be a good approximation of the current database size. When searching against a database of HMMs/profiles, any -Z that is in tens of thousands should be fine. Again, it is a matter of picking a number that will ensure internal consistency of scoring rather than trying to find an optimal -Z value.

ADD REPLY • link 4.0 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Ok interesting, it might be a good idea if I start using -Z 45638612 to keep analysis pipelines fairly consistent. My last question...is the score affected by -Z or only the E-value calculation?

ADD REPLY • link 4.0 years ago by O.rka ▴ 710

0

Entering edit mode

Bit-scores are not affected by -Z, only E-value.

ADD REPLY • link 4.0 years ago by Mensur Dlakic ★ 27k