Question: Cluster RNA sequences from fasta alignment by identity threshold
I have big alignments of RNA sequences (16-200 thousand sequences) and I need to cluster them by an identity threshold. Basically what I want to do is: - count identity of sequences distribution for these alignments, - after discovering this distribution, I would like to cluster these sequences by an identity threshold for example create file, with sequences from my current alignment with sequences that are identical at least at 50% and more, 60% and more ... and so on;

Just to clarify, I consider identity of sequence as number of positions that their nuclotides are identical for exaple:

seq1 ATA seq2 GTG seq1 and seq2 are identical at 33,3%.

My question is do You know any software or method that would help me to solve that issues?

Thank You all for reading this and for possible answers in advance.

clumpify from BBMap suite will allow you to clump the duplicate sequences together (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ). I don't recollect software that would allow you to do what you are asking for in an iterative way.

BTW: What are you trying to achieve? Perhaps there is a different way to do it.

BTW: What are you trying to achieve? Perhaps there is a different way to do it.

Reason to do this is described in this SA question:

Reason to do this is described in this SA question:

Hello filip7grudzien!

It appears that your post has been cross-posted to another site:

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 10 weeks ago by genomax51k

Thank You, I added a link to my post from here in seqanswers not to duplicate content, I just wanted to higher a chance that somebody could help me, sorry for an inconvenience.

ADD REPLYlink written 10 weeks ago by filip7grudzien0
CD-HIT is designed for exactly this.

CD-HIT is designed for exactly this.
