Question

Cluster RNA sequences from fasta alignment by identity threshold

0

Entering edit mode

7.2 years ago

filip7grudzien • 0

Hello everyone,

I have big alignments of RNA sequences (16-200 thousand sequences) and I need to cluster them by an identity threshold. Basically what I want to do is: - count identity of sequences distribution for these alignments, - after discovering this distribution, I would like to cluster these sequences by an identity threshold for example create file, with sequences from my current alignment with sequences that are identical at least at 50% and more, 60% and more ... and so on;

Just to clarify, I consider identity of sequence as number of positions that their nuclotides are identical for exaple:

seq1 ATA seq2 GTG seq1 and seq2 are identical at 33,3%.

My question is do You know any software or method that would help me to solve that issues?

Thank You all for reading this and for possible answers in advance.

sequences identity RNA clustering RNA alignment • 1.7k views

ADD COMMENT • link updated 7.0 years ago by Biostar 20 • written 7.2 years ago by filip7grudzien • 0

0

Entering edit mode

clumpify.sh from BBMap suite will allow you to clump the duplicate sequences together (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ). I don't recollect software that would allow you to do what you are asking for in an iterative way.

BTW: What are you trying to achieve? Perhaps there is a different way to do it.

Reason to do this is described in this SA question: http://seqanswers.com/forums/showthread.php?t=82133

ADD REPLY • link 7.2 years ago by GenoMax 152k

0

Entering edit mode

Hello filip7grudzien!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=82132

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY • link 7.2 years ago by GenoMax 152k

0

Entering edit mode

Thank You, I added a link to my post from here in seqanswers not to duplicate content, I just wanted to higher a chance that somebody could help me, sorry for an inconvenience.

ADD REPLY • link 7.2 years ago by filip7grudzien • 0

score 0 · Answer 1 · 2018-06-24

0

Entering edit mode

7.0 years ago

Joe 22k

CD-HIT is designed for exactly this.

ADD COMMENT • link 7.0 years ago by Joe 22k