Question: "Protein search using Blastp" vs "protein search using hmm profile"
gravatar for dago
5.8 years ago by
dago2.6k wrote:

I have a list of proteins that define a family of enzyme.

I want to look if these enzymes are present in a genome I am working with. Therefore, I applied two approaches:

  1. blastp the above mentioned sequences against the genome
  2. create a hmm profile for the proteins and use hmmscan to search for them in the genome

In both case I get a variable number of hits.

However, there is something that I am missing from the biological point of view.

The local alignment of blast would suggest me that the obtained hits could be similar to the query proteins also in region that are not directly connected to the enzymatic reaction. Therefore, the hits I obtained can in theory also not have a similar function to the reference.

Creating a hmm profile instead, I will just consider the region in the proteins that are highly conserved and that are likely crucial for the functionality of the enzyme. Therefore, using this approach would give me hits that share more likely similarity in function with the reference proteins.

Am I missing something or my reasoning is correct?

ADD COMMENTlink modified 5.8 years ago by Siva1.7k • written 5.8 years ago by dago2.6k
gravatar for Siva
5.8 years ago by
United States
Siva1.7k wrote:

Yes, your reasoning is correct especially for multicellular eukaryotes where majority of the proteins are multi-domain proteins. If your query protein contains a promiscuous domain, It is possible that BLAST can pick functionally unrelated proteins that share only the promiscuous domain. This gets worse for PSI-BLAST. One way to handle the problem is to restrict the match length (i.e. the subject should match at least 80% of the query protein). Or even better would be your second choice of using profile HMMs of only the domain of interest.

ADD COMMENTlink written 5.8 years ago by Siva1.7k

Ideally if there is enough homology between query and subject you won't have much of an issue with similar domains leading to false positive results.

However hmmscan won't fix this if you're only searching for a single domain or if the conserved region you select shows up in other places. Depending on what your enzymatic domain is, it could be present in other proteins. In other words, the opposite case of what is described by OP could be true, the enzymatic parts match but the other regions don't.

HMMs are a more advanced model, but the general concept is the same as PSI BLAST. You're basically searching with a 'fuzzy' representation of a group of sequences rather than a single sequence.

I think your best bet would be to maximize alignment length or confirm each BLAST/HMM hit with a global aligner (e.g. clustalo).

ADD REPLYlink written 5.8 years ago by pld4.9k

I agree with most of what you wrote. There are at least two kinds of false positives in OP's case:

  1. Proteins that do not have OP's domain of interest but only a shared promiscuous domain
  2. Proteins that have OP's domain of interest but not necessarily homologs

OP stated that using BLAST might produce the first kind of false positives but using HMMScan will avoid these particular false positives. This is correct. HMMScan with a profile HMM built with "the region in the proteins that are highly conserved and that are likely crucial for the functionality of the enzyme." will completely avoid the first kind of false positives. You are mentioning about the second kind of false positives. More analyses like comparing domain architecture, global alignment or phylogenetic analyses are needed to eliminate these false positives.

Though PSSMs and profile HMMs based searches are similar in concept, the reason why PSI-BLAST is worse for the first kind of false positives is due to its iterative searches. A PSSM generated with a promiscuous domain is going to pick up all kinds of unrelated hits.

ADD REPLYlink written 5.8 years ago by Siva1.7k

Hi Siva, I came across this thread while search for the difference between HMM based alignment and traditional local alignment, like BLAST. In my mind, Both BLAST and HMM approach can only return us the alignment optimized for 'local similarity', neither of them can tell whether these similar regions are more likely to be homolog or promiscuous domain. So why should we refer HMM approach over BLAST? I am guessing that I missed something about their difference.

ADD REPLYlink written 13 months ago by CY550
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1875 users visited in the last hour