How to use CD-HIT to filter protein sequences dataset by certain similarity threshold?
1
0
Entering edit mode
4.5 years ago
opronu • 0

I am looking at using CD-HIT to efficiently filter my protein sequences dataset by a similarity threshold of 70% (cut-off). More precisely, what I want to achieve is that for all the remaining sequences after filtering, all pairwise sequence similarity scores that can be computed are less than 70%.

Does that mean that I should run CD-HIT on my dataset with similarity threshold=70% (rest of the settings at default values), and then just keep only the "representative" sequences from the resulting cluster files? I have read the guide (http://weizhongli-lab.org/lab-wiki/doku.php?id=cd-hit-user-guide) but I am still not sure of how to use the output.

Linking to the above, how would CD-HIT handle this scenario below or similar ones like this:

  • assuming seq. similarity threshold=70%
  • sim.(seqA, seqB) = 80%
  • sim.(seqA,seqC) = 90%
  • sim.(seqA,seqD) = 65%
  • sim.(seqA, seqE) = 60%
  • sim.(seqD, seqE) = 50%
  • => here, would the method keep just seqE, seqD in the output as sequences satisfying the threshold?

Thanks!

sequence • 1.6k views
ADD COMMENT
1
Entering edit mode
4.5 years ago
Mensur Dlakic ★ 30k

Your understanding of the algorithm is correct, but it is based on identity thresholds, not similarity.

In your scenarios, the algorithm would do the following (again assuming identity rather than similarity as you indicate):

  • throw out the shorter sequence between A and B
  • throw out the shorter sequence between A and C
  • both sequences would be retained
  • both sequences would be retained
  • both sequences would be retained

It is impossible to answer your questions completely without knowing sequence lengths, but it is likely that the longest sequence in the ABC group would be retained, along with both D and E.

ADD COMMENT
0
Entering edit mode

Thank you Mensur, this is now clear to me!

And yes correct, I should be referring to identity thresholds here.

ADD REPLY

Login before adding your answer.

Traffic: 3201 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6