Question

Redundancy removal before HMM profiles are made ?

1

Entering edit mode

9.4 years ago

Nikhil Chaudhary ▴ 60

Hi community,

I am developing a tool to find homologs of a set of proteins in a metagenome. One of the steps is to take all the sequences of each protein and make models for it using HMMer. I am an undergrad student and dont know very much about HMM (i have read the basics though).

My primary question is whether I should cluster the highly similar sequences using CD-hit to remove redundancy ? Also, in case I should, what would be an ideal threshold ?

Another question is regarding sequences of highly variable lengths. If the representative sequences of a protein are highly variable in length, does it cause any problem ? I believe it shouldn't.

I would appreciate any help/guides/comments.

P.S. It is my first time posting here so please forgive if I have not followed some guidelines.

Best Regards

sequence alignment genome • 3.2k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by Nikhil Chaudhary ▴ 60

Ram · Answer 1 · 2014-12-02

1

Entering edit mode

9.4 years ago

Siva ★ 1.9k

Yes, you should remove redundancy in your sequence dataset before building HMMs. As for the threshold, you might want to check what threshold Pfam or HHblits is using for clustering the sequences. I would use a 50% sequence similarity threshold and >80% alignment coverage between two sequences.

ADD COMMENT • link 9.4 years ago by Siva ★ 1.9k

0

Entering edit mode

Thank you for the response. You have clarified a big doubt. Yesterday I used Cd-hit for this purpose and clustered sequences at 0.85 threshold with no other criteria (i plan to check some lower and higher thresholds too). On most test sequences, i got significantly higher bit-scores with clustering compared to the non-clustered data.

my average bit scores are somewhere around 300 for whole data and 350 for clustered sequences.

Many proteins in my database have only 5 - 10 representative sequences. Will it be a good idea to make profiles for such proteins ?

Can you please also tell me if hmmcalibrate function is still available in hmmer 3, as it was in hmmer 2 but i cant find any reference to it in hmmer 3 docs ?

I will be really thankful if you can answer any of these questions. Even if you dont, I would still like to thank you a lot for the answer to the original question.

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 9.4 years ago by Nikhil Chaudhary ▴ 60

1

Entering edit mode

Small number of representative sequences should be fine for building HMMs. Pfam has several families with relatively small number of sequences in the seed alignment.

I have not used HMMER3. It seems that the new algorithm uses a different way of estimating e-values and does not require calibration anymore.

From Wikipedia:

Further gains in performance are due to a log-likelihood model that requires no calibration for estimating E-values, and allows the more accurate forward scores to be used for computing the significance of a homologous sequence.

And the relevant paper:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2396288/

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 9.4 years ago by Siva ★ 1.9k

0

Entering edit mode

Thank you mate. Thank you very much. You might not realize it but you have done a great favor.

I will now make profiles for proteins which have a minimum of 5 sequences after clustering. I have checked clustering at thresholds 0.50, 0.70, 0.85, 0.90 and the best scores so far are at 0.70 and 0.85.