multilabel clustering/classification based on sequence silimarity
Entering edit mode
5.0 years ago
odoluca ▴ 20

I have been dealing with a new issue regarding clustering/classifying sequences.

I need an algorithm that can cluster huge number of short sequences. To simulate the problem I have made some visuals and a small set of sequences.


From the image it should be clear that there are 2 conserved sequences, the blue and the orange. The green shows variations

What I need is to label the sequences based on what conserved sequence they contain, so that there will be two labels; label A: 0,1,2,5,6,7,9,10 and label B: 0, 3, 4, 5, 7, 8, 11

If I build a similarity matrix based of pairwise alignment I build this matrix:


But I am not sure how to process the matrix further to label them. Anyone has any idea?

sequence pairwise clustering classification • 1.3k views
Entering edit mode

I am not sure I get what you're trying to achieve. If you want to identify conserved sequences then the standard approach is to use multiple sequence alignments. If you already know the conserved sequences and want to find out if they are present in your sequences, you could simply assess similarity of each test sequence to each of the conserved sequences. A pairwise similarity matrix only tells you how closely related two sequences are but not what kind of motif/sequence they share.


Login before adding your answer.

Traffic: 973 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6