Question: Cluster RNA sequences from fasta alignment by identity threshold
0
gravatar for filip7grudzien
7 months ago by
filip7grudzien0 wrote:

Hello everyone,

I have big alignments of RNA sequences (16-200 thousand sequences) and I need to cluster them by an identity threshold. Basically what I want to do is: - count identity of sequences distribution for these alignments, - after discovering this distribution, I would like to cluster these sequences by an identity threshold for example create file, with sequences from my current alignment with sequences that are identical at least at 50% and more, 60% and more ... and so on;

Just to clarify, I consider identity of sequence as number of positions that their nuclotides are identical for exaple:

seq1 ATA seq2 GTG seq1 and seq2 are identical at 33,3%.

My question is do You know any software or method that would help me to solve that issues?

Thank You all for reading this and for possible answers in advance.

ADD COMMENTlink modified 5 months ago by Biostar ♦♦ 20 • written 7 months ago by filip7grudzien0

clumpify.sh from BBMap suite will allow you to clump the duplicate sequences together (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ). I don't recollect software that would allow you to do what you are asking for in an iterative way.

BTW: What are you trying to achieve? Perhaps there is a different way to do it.

Reason to do this is described in this SA question: http://seqanswers.com/forums/showthread.php?t=82133

ADD REPLYlink modified 7 months ago • written 7 months ago by genomax59k

Hello filip7grudzien!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=82132

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 7 months ago by genomax59k

Thank You, I added a link to my post from here in seqanswers not to duplicate content, I just wanted to higher a chance that somebody could help me, sorry for an inconvenience.

ADD REPLYlink written 7 months ago by filip7grudzien0
0
gravatar for jrj.healey
5 months ago by
jrj.healey9.1k
United Kingdom
jrj.healey9.1k wrote:

CD-HIT is designed for exactly this.

ADD COMMENTlink written 5 months ago by jrj.healey9.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1277 users visited in the last hour