Clustering sequences with low identity
1
0
Entering edit mode
17 months ago
agata88 ▴ 840

Hi all!

I need to cluster sequences with 30-40% of identity threshold. Which tool would you recommend?

I've tried CD-HIT, but it's not recommended for such low identity. I was also searching for clustering with identity threshold in MUMmer, kalign and clustalO, but I could not find it.

Any help would be much appreciated!

Edit: I have two types of input files to cluster, basic fasta file with multiple sequences and aligned fasta file (kalign).

Best, Agata

clustering sequences • 832 views
0
Entering edit mode

Other tools that came to mind are USEARCH and VSEARCH (https://drive5.com/usearch/manual/uclust_algo.html) but they work quite similar as CD-HIT. Because of the low percentage you may need to share your end goal. Maybe others can give a better solution than clustering.

This can also be interesting but afterwards you need to do some filtering/parsing yourself:

https://drive5.com/usearch/manual/cmd_allpairs_local.html

https://drive5.com/usearch/manual/cmd_allpairs_global.html

0
Entering edit mode

Thank you for your suggestions. I've performed aligning with kalign and prepare matrix identity with clustalo. I saw that my sequences are not very similar to each other and decided to cluster them with 30-40% (which is the mean of matrix identity). CD-HIT gave me absurd results, other programs didn't have an option to type identity thresholds.

In the meantime I found this question, which is very similar to mine - Clustering sequence on similarity using percentage identity matrix I will try mentioned solutions.

Best, Agata

0
Entering edit mode

How about generating tlsh hashes of your sequences and then using the xref command for getting pairwise distances? You can then cluster the resulting distance matrix (my preferred clustering algorithm for small to medium sized sets is affinity propagation). Depending how long your sequences are, it may be more sensible to create (perhaps) exhaustive mash sketches and then calculating their pairwise distances with the dist command and again cluster the resulting distance matrix with affinity propagation or something else..

3
Entering edit mode
17 months ago
Mensur Dlakic ★ 20k

I used MMseqs2 to cluster down to 20% identity. It is multithreaded and will be reasonably fast even on large databases. If you have a really large database (> 10-50 million sequences), I suggest moving down in stepwise fashion like with CD-HIT.

0
Entering edit mode

Thank you! MMseq2 did the job!

Best, Agata