Question

Use HHblits to find homologs in multiple species ?

0

Entering edit mode

4.0 years ago

khhgng ▴ 70

Hello !

For a protein domain of interest (~50 AA), I have close to 1000 seed sequences from pfam. I want to search members of this multigene family(having this domain) in different species.

By using HMMER and BLAST, many proteins of this family have been identified in past. Can hhsuite3 be used to improve upon them by identifying more remote homologs?
If yes, for hhblits/hhsearch , what I could infer that I'll need custom proteome database for each species and then search them with MSA (of these seed sequences).

Please help if that's the right way or what would be a more appropriate ?

PS. I am low on computational resources !!

Many thanks

alignment hhsuite hmmer hhblits proteome • 2.8k views

ADD COMMENT • link 2.9 years ago by khhgng ▴ 70

score 1 · Answer 1 · 2020-05-01

The problem you've described in your post is similar to one of my research problems from a decade back, and I did try to compare results for protein domain discovery using HMMER2 vs HMMER3 vs HHBLITS. I will try to provide some summary points that might help you. Feel free to ask any follow-up questions.

1. Obviously you must be aware that HHBLITS uses profile::profile comparison to detect remote matches, and is the most sensitive. Deprecated HMMER2 and the current HMMER employs profile::sequence comparison to detect matches, this is less sensitive, but good enough, and widely used for domain discovery and annotation of protein sequences. BLAST on the other hand uses sequence-sequence heuristics to report matches, and is the least sensitive. Usually BLAST is not used for protein domain discovery or annotation, unless it is highly conserved.

I suspect your protein domain is NOT highly conserved, or you would not be thinking about HHBLITS, am I right?

2. When homology < 30% sequence identity, the rate of finding false positives increases. When it is < 20%, it starts to become very hard to distinguish true positives from false ones. So when you want to detect remote homologous sequences, this is something you need to be careful about... Usually when you have higher sensitivity, the method will suffer from lower specificity. As a workaround, the context specific alphabet of 219 letters bumps up the specificity - in fact CS-BLAST is demonstrably more specific than regular BLAST. So you can get away with claiming that HHBLITS is BOTH sensitive (low false negative) with reasonable specificity (low false positive).

3. The computational requirements to run HHBLITS are significant - You do not describe what resources you have at your disposal. You'd need a machine with multiple cpus (more the quicker) and high RAM (64GB for some steps), very different from what you'd find on a typical desktop... I used a High Performance Computing Center famility at my university for my HHBLITS runs.

If you already did not know this, here is a very simplified summary of how HHBLITS pipeline (used to) works:

you'd need to download the entire annotated proteome for your species of interest,
find homologs to each of those sequences against a non-redundified (very large) database of proteins
align those and thereby convert each query sequence => alignment => profile based not on 20 aa residues (like in HMMER), but a CS219 alphabet
download one or more HHBLITS profiles from Pfam version 32 or whichever is the latest, pre-made by the HHSUITE research group
Scan your database of protein profiles (for your proteomes) Vs. 1 HHBLITS Pfam profile of interest, OR
Scan your database of HHBLITS Pfam profiles Vs. One CS219 profile for 1 protein from your proteome

You would need to repeat this process for each species / proteomes of interest. Is it really worth it? Let me try answer this question below.

Based on my own experience, also looking at a protein domain ~ 50aa long, with very low sequence conservation, but quite high structural conservation, I concluded the following:

A. it may be better to consider HMMER2 for discovering matches to short, diverse domain sequences rather than HMMER3. Or something that is a hybrid approach - https://biologydirect.biomedcentral.com/articles/10.1186/s13062-016-0163-0

B. HHBLITS is extremely computationally intensive, and for just a handful of additional proteins that I was indeed able to add to the list of hundreds that were common discoveries across HMMER2, HMMER3 and HHBLITS, it may not be worth it. Also, these methods will return slightly different domain boundaries. And if you need to annotate proteins with domain start-stop coordinates, it becomes another headache to decide which tool's results you will use for consistency and comparison - public databases commonly use InteproScan, and to some extent PfamScan.

C. In the past (and probably still), HHBLITS was (is?) a top performer for research questions, where it was important to find a template for structural modeling - these are low throughput studies. What I performed, and what you intend to perform are high throughput (proteome scale) studies, which is NOT what HHBLITS was originally intended for.

In conclusion, this is my warning to you: Stick with HMMER2 or HMMER3 or some such variant, rather than using HHBLITS for protein domain discovery.

score 0 · Answer 2 · 2020-05-01

PS. I am low on computational resources !!

You should not keep secrets from us :D It would be easier to advise you knowing your exact computational resources.

To your first question, the answer is yes in general. In any particular case, however, HHsuite may not be better than HMMer.

To your second question, building a custom HMM database takes time and resources, and I don't know whether that is feasible for you or worth the effort. For example, I have no idea if you are willing / able to spend few weeks or a month only to find 50 additional proteins on top of a thousand you already know about.

I suggest you start an HHpred search with your seed alignment. One of the files produced by this protocol will be an alignment of all the hits - BLAST those sequences against your proteomes. It should be a good assumption that significant hits in your proteomes are good candidates to be members of the same family. Compare the results with HMMer and see what makes the most sense. This procedure would take considerably less time than building HMMs for your complete proteomes, either via individual searching (as Anand suggested) or via clustering (which is probably better but still time-consuming).