Question: Cluster RNA sequences from fasta alignment by identity threshold
0
gravatar for filip7grudzien
11 weeks ago by
filip7grudzien0 wrote:

Hello everyone,

I have big alignments of RNA sequences (16-200 thousand sequences) and I need to cluster them by an identity threshold. Basically what I want to do is: - count identity of sequences distribution for these alignments, - after discovering this distribution, I would like to cluster these sequences by an identity threshold for example create file, with sequences from my current alignment with sequences that are identical at least at 50% and more, 60% and more ... and so on;

Just to clarify, I consider identity of sequence as number of positions that their nuclotides are identical for exaple:

seq1 ATA seq2 GTG seq1 and seq2 are identical at 33,3%.

My question is do You know any software or method that would help me to solve that issues?

Thank You all for reading this and for possible answers in advance.

ADD COMMENTlink modified 21 days ago by Biostar ♦♦ 20 • written 11 weeks ago by filip7grudzien0

clumpify.sh from BBMap suite will allow you to clump the duplicate sequences together (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ). I don't recollect software that would allow you to do what you are asking for in an iterative way.

BTW: What are you trying to achieve? Perhaps there is a different way to do it.

Reason to do this is described in this SA question: http://seqanswers.com/forums/showthread.php?t=82133

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by genomax51k

Hello filip7grudzien!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=82132

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 10 weeks ago by genomax51k

Thank You, I added a link to my post from here in seqanswers not to duplicate content, I just wanted to higher a chance that somebody could help me, sorry for an inconvenience.

ADD REPLYlink written 10 weeks ago by filip7grudzien0
0
gravatar for jrj.healey
21 days ago by
jrj.healey4.8k
United Kingdom
jrj.healey4.8k wrote:

CD-HIT is designed for exactly this.

ADD COMMENTlink written 21 days ago by jrj.healey4.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1511 users visited in the last hour