Optimizing Hmmer Speed/Performance
1
4
Entering edit mode
13.3 years ago
Rvosa ▴ 580

I have only recently started to experiment with HMMER (3.0, http://hmmer.org/) and I find it to be somewhat slow, but this is probably because I'm doing everything very naively. I've identified a number of potential issues, hoping to get some feedback on which avenues are worth pursuing.

  1. Most importantly, I've had issues compiling the optimized code. I'm building it on a PPC cluster that, if I'm not mistaken, has special instructions which HMMER should be able to take advantage of. I tried to pass --enable-altivec and --enable-vmx to the configure script, but it keeps telling me it will go with "dummy" (unoptimized) code anyway. If I don't pass anything and I use gcc as my compiler, it pretends to pick everything up correctly, but then the "make check" target gives many errors. Specifically setting --enable-dummy yields a working executable, but with very low performance. If anyone has experience building HMMER, I would love to pick your brains to get this to work correctly.

  2. I'm trying to run jackhmmer against a database of nine mammalian genomes. In dummy mode, I got it to return reasonable results (but slowly). However, my database is simply a massive FASTA file, and I find this surprising. Is there some sort of equivalent to "formatdb" for HMMER that I'm not aware of?

  3. Somewhat unrelated, but the default E value for jackhmmer (10) seems incredibly "lenient". Shouldn't this be a number that's many orders of magnitude smaller?

Thanks!

P.S. My first question. Hope I phrased it clearly and with the correct tagging and formatting.

hmmer • 4.8k views
ADD COMMENT
0
Entering edit mode

What is the exact model of PPC? AltiVec/VMX are not available to all models. On linux, you may check the availability from /proc/cpuinfo. On SSE3-supported CPUs, enabling SIMD makes Smith-Waterman tens of times faster.

ADD REPLY
0
Entering edit mode

Good point. Turns out we have Power5 chips, which don't have AltiVec, unfortunately.

ADD REPLY
2
Entering edit mode
13.3 years ago

Hi rvosa,

Welcome to the fantastic world of bioinformatics Q&A!

First of all, HMMER will run optimized on PPC with Power6 cores at least. They were available for servers since 2007. So, you will need to check that.

To your second point, HMMMER comes with a utility called hmmpress that compresses/turn in binaries you db. It's not like formatdb but saves some time/memory/space.

The third point is more delicate. Standard parameters in HMMER are terrible. They only work if you want some sort of primary filter. E-values are sensitive to your db size and somewhat tricky to calibrate. If you know what you want consider using bit scores instead of E-values. Recently, I was annotating selenoproteins in Naegleria gruberi with HMMER3. My E-value cutoff for domains was 1e-04. This is quite restrictive. So, it's up to you to set thresholds for hit/domain report. What is your necessity?

ADD COMMENT
0
Entering edit mode

I looked into it, and it turns out that we have PPC Power5 cores, which don't have AltiVec instructions. That means that I can only run the "dummy" code on our cluster, which is pretty much a non-starter, unfortunately.

About the hmmpress utility, I probably just don't understand how everything is supposed to fit together because it seems to want stockholm alignments, whereas my "database" is a large file of unaligned fasta sequences.

About E-values versus bit scores, I'm doing some background reading about how they relate to database size and query sequence length.

ADD REPLY
0
Entering edit mode

Just a tip: HMMER3 user guide do not cover all possible options of its utilities. By now it's possible to use stockholm and fasta on most of them. hmmpress is to be used on hmms. It's useless if just want to find a given protein in a genome. But it's quite useful if you intend to crunch up the entire pfam against your target. formatdb works backwards. Besides that all, Eddy and pals don't tell how hmmer models work. The only source of information is their book Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.

ADD REPLY

Login before adding your answer.

Traffic: 1542 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6