Question: Redundancy removal before HMM profiles are made ?
gravatar for Nikhil Chaudhary
6.2 years ago by
Nikhil Chaudhary60 wrote:

Hi community,

I am developing a tool to find homologs of a set of proteins in a metagenome. One of the steps is to take all the sequences of each protein and make models for it using HMMer. I am an undergrad student and dont know very much about HMM (i have read the basics though).

My primary question is whether I should cluster the highly similar sequences using CD-hit to remove redundancy ? Also, in case i should, what would be an ideal threshold ?

Another question is regarding sequences of highly variable lengths. If the representative sequences of a protein are highly variable in length, does it cause any problem ? I believe it shouldn't.

I would appreciate any help/guides/comments. 

P.S. It is my first time posting here so please forgive if i have not followed some guidelines.

Best Regards

alignment sequence genome • 2.4k views
ADD COMMENTlink modified 6.2 years ago by Siva1.7k • written 6.2 years ago by Nikhil Chaudhary60
gravatar for Siva
6.2 years ago by
United States
Siva1.7k wrote:

Yes, you should remove redundancy in your sequence dataset before building HMMs. As for the threshold, you might want to check what threshold Pfam or HHblits is using for clustering the sequences. I would use a 50% sequence similarity threshold and >80% alignment coverage between two sequences.

ADD COMMENTlink written 6.2 years ago by Siva1.7k

Thank you for the response. You have clarified a big doubt. Yesterday I used Cd-hit for this purpose and clustered sequences at 0.85 threshold with no other criteria (i plan to check some lower and higher thresholds too). On most test sequences, i got significantly higher bit-scores with clustering compared to the non-clustered data.

my average bit scores are somewhere around 300 for whole data and 350 for clustered sequences.

Many proteins in my database have only 5 - 10 representative sequences. Will it be a good idea to make profiles for such proteins ?

Can you please also tell me if hmmcalibrate function is still available in hmmer 3, as it was in hmmer 2 but i cant find any reference to it in hmmer 3 docs ?

I will be really thankful if you can answer any of these questions. Even if you dont, I would still like to thank you a lot for the answer to the original question.

ADD REPLYlink modified 16 months ago by Ram32k • written 6.2 years ago by Nikhil Chaudhary60

Small number of representative sequences should be fine for building HMMs. Pfam has several families with relatively small number of sequences in the seed alignment.

I have not used HMMER3. It seems that the new algorithm uses a different way of estimating e-values and does not require calibration anymore.

From Wikipedia:

Further gains in performance are due to a log-likelihood model that requires no calibration for estimating E-values, and allows the more accurate forward scores to be used for computing the significance of a homologous sequence.

And the relevant paper:

ADD REPLYlink modified 16 months ago by Ram32k • written 6.2 years ago by Siva1.7k

Thank you mate. Thank you very much. You might not realize it but you have done a great favor.

I will now make profiles for proteins which have a minimum of 5 sequences after clustering. I have checked clustering at thresholds 0.50, 0.70, 0.85, 0.90 and the best scores so far are at 0.70 and 0.85.

ADD REPLYlink written 6.2 years ago by Nikhil Chaudhary60

You are welcome. Glad to help.

ADD REPLYlink written 6.2 years ago by Siva1.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1258 users visited in the last hour