Question

Clustering sequences with low identity

0

Entering edit mode

3.4 years ago

agata88 ▴ 870

Hi all!

I need to cluster sequences with 30-40% of identity threshold. Which tool would you recommend?

I've tried CD-HIT, but it's not recommended for such low identity. I was also searching for clustering with identity threshold in MUMmer, kalign and clustalO, but I could not find it.

Any help would be much appreciated!

Edit: I have two types of input files to cluster, basic fasta file with multiple sequences and aligned fasta file (kalign).

Best, Agata

clustering sequences • 1.8k views

ADD COMMENT • link updated 3.4 years ago by Mensur Dlakic ★ 27k • written 3.4 years ago by agata88 ▴ 870

0

Entering edit mode

Other tools that came to mind are USEARCH and VSEARCH (https://drive5.com/usearch/manual/uclust_algo.html) but they work quite similar as CD-HIT. Because of the low percentage you may need to share your end goal. Maybe others can give a better solution than clustering.

This can also be interesting but afterwards you need to do some filtering/parsing yourself:

https://drive5.com/usearch/manual/cmd_allpairs_local.html

https://drive5.com/usearch/manual/cmd_allpairs_global.html

ADD REPLY • link 3.4 years ago by gb ★ 2.2k

0

Entering edit mode

Thank you for your suggestions. I've performed aligning with kalign and prepare matrix identity with clustalo. I saw that my sequences are not very similar to each other and decided to cluster them with 30-40% (which is the mean of matrix identity). CD-HIT gave me absurd results, other programs didn't have an option to type identity thresholds.

In the meantime I found this question, which is very similar to mine - Clustering sequence on similarity using percentage identity matrix I will try mentioned solutions.

Best, Agata

ADD REPLY • link 3.4 years ago by agata88 ▴ 870

0

Entering edit mode

How about generating tlsh hashes of your sequences and then using the xref command for getting pairwise distances? You can then cluster the resulting distance matrix (my preferred clustering algorithm for small to medium sized sets is affinity propagation). Depending how long your sequences are, it may be more sensible to create (perhaps) exhaustive mash sketches and then calculating their pairwise distances with the dist command and again cluster the resulting distance matrix with affinity propagation or something else..

ADD REPLY • link 3.4 years ago by 5heikki 11k

score 3 · Accepted Answer · 2021-02-22

3

Entering edit mode

3.4 years ago

Mensur Dlakic ★ 27k

I used MMseqs2 to cluster down to 20% identity. It is multithreaded and will be reasonably fast even on large databases. If you have a really large database (> 10-50 million sequences), I suggest moving down in stepwise fashion like with CD-HIT.

ADD COMMENT • link 3.4 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thank you! MMseq2 did the job!

Best, Agata

ADD REPLY • link 3.4 years ago by agata88 ▴ 870