Question: Clustering sequences with low identity
0
gravatar for agata88
8 days ago by
agata88810
Poland
agata88810 wrote:

Hi all!

I need to cluster sequences with 30-40% of identity threshold. Which tool would you recommend?

I've tried CD-HIT, but it's not recommended for such low identity. I was also searching for clustering with identity threshold in MUMmer, kalign and clustalO, but I could not find it.

Any help would be much appreciated!

Edit: I have two types of input files to cluster, basic fasta file with multiple sequences and aligned fasta file (kalign).

Best, Agata

clustering sequences • 68 views
ADD COMMENTlink modified 8 days ago by Mensur Dlakic9.0k • written 8 days ago by agata88810

Other tools that came to mind are USEARCH and VSEARCH (https://drive5.com/usearch/manual/uclust_algo.html) but they work quite similar as CD-HIT. Because of the low percentage you may need to share your end goal. Maybe others can give a better solution than clustering.

This can also be interesting but afterwards you need to do some filtering/parsing yourself:

https://drive5.com/usearch/manual/cmd_allpairs_local.html

https://drive5.com/usearch/manual/cmd_allpairs_global.html

ADD REPLYlink modified 8 days ago • written 8 days ago by gb1.9k

Thank you for your suggestions. I've performed aligning with kalign and prepare matrix identity with clustalo. I saw that my sequences are not very similar to each other and decided to cluster them with 30-40% (which is the mean of matrix identity). CD-HIT gave me absurd results, other programs didn't have an option to type identity thresholds.

In the meantime I found this question, which is very similar to mine - Clustering sequence on similarity using percentage identity matrix I will try mentioned solutions.

Best, Agata

ADD REPLYlink written 8 days ago by agata88810

How about generating tlsh hashes of your sequences and then using the xref command for getting pairwise distances? You can then cluster the resulting distance matrix (my preferred clustering algorithm for small to medium sized sets is affinity propagation). Depending how long your sequences are, it may be more sensible to create (perhaps) exhaustive mash sketches and then calculating their pairwise distances with the dist command and again cluster the resulting distance matrix with affinity propagation or something else..

ADD REPLYlink modified 8 days ago • written 8 days ago by 5heikki9.3k
3
gravatar for Mensur Dlakic
8 days ago by
Mensur Dlakic9.0k
USA
Mensur Dlakic9.0k wrote:

I used MMseqs2 to cluster down to 20% identity. It is multithreaded and will be reasonably fast even on large databases. If you have a really large database (> 10-50 million sequences), I suggest moving down in stepwise fashion like with CD-HIT.

ADD COMMENTlink written 8 days ago by Mensur Dlakic9.0k

Thank you! MMseq2 did the job!

Best, Agata

ADD REPLYlink written 7 days ago by agata88810
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2055 users visited in the last hour
_