Question

hmmsearch vs hmmscan - e-value, speed, & output differences?

0

Entering edit mode

3.9 years ago

jabaron.phd • 0

I want to identify protein domains in predicted ORFs from a de novo metatranscriptome assembly. If I use hmmsearch, because it's faster computationally than hmmscan (see extra info below), to compare the HMM profiles in the PFAM database to my predicted peptide database and set the -Z option to the number of HMM's in PFAM, will the e-values and output of hmmsearch be identical to hmmscan with the same files (minus the -Z option)?

In comparing hmmscan and hmmsearch the authors of hmmer point out in the blog post, hmmscan vs. hmmsearch speed: the numerology:

hmmscan and hmmsearch are doing exactly the same compute, at heart: comparing one profile to one sequence at a time. Their bit score results are identical. You can save hmmsearch tabular output files and use ’em just the same way you were going to use the hmmscan files.

They also point out that hmmsearch is faster because both programs are input-bound and hmmsearch loads less data but they include this caveat in the post:

(Um, watch out for E-values: remember that E-values depend on the size of the database you search.)

I know that E-values are dependent on database size. My understanding is that only the target database size influences E-value and that the database for hmmscan E-value is the hmm file (PFAM in my case) while the database for hmmsearch is the sequence file.

hmmer protein hmmsearch E-value hmmscan • 6.1k views

ADD COMMENT • link 3.9 years ago by jabaron.phd • 0

score 2 · Accepted Answer · 2020-05-14

The answer to your question is yes, but I wouldn't do it the way you described. At the very least, I would use the larger of the two database sizes. So if your sequence database has more than ~18000 entries (roughly the size of Pfam), I would use the size of sequence database rather than of Pfam database. To make your life easier, consider specifying the same -Z switch with both commands. A separate issue is how the -Z switch should be specified to maintain internal consistency of scoring over a period of time. There is a discussion about that in this thread.