multilabel clustering/classification based on sequence silimarity
0
0
Entering edit mode
3.9 years ago
odoluca ▴ 20

I have been dealing with a new issue regarding clustering/classifying sequences.

I need an algorithm that can cluster huge number of short sequences. To simulate the problem I have made some visuals and a small set of sequences.

fig1 http://uploads.im/zxAZC.png

From the image it should be clear that there are 2 conserved sequences, the blue and the orange. The green shows variations

What I need is to label the sequences based on what conserved sequence they contain, so that there will be two labels; label A: 0,1,2,5,6,7,9,10 and label B: 0, 3, 4, 5, 7, 8, 11

If I build a similarity matrix based of pairwise alignment I build this matrix:

fig2 http://uploads.im/1jsJx.png

But I am not sure how to process the matrix further to label them. Anyone has any idea?

sequence pairwise clustering classification • 1.1k views
ADD COMMENT
0
Entering edit mode

I am not sure I get what you're trying to achieve. If you want to identify conserved sequences then the standard approach is to use multiple sequence alignments. If you already know the conserved sequences and want to find out if they are present in your sequences, you could simply assess similarity of each test sequence to each of the conserved sequences. A pairwise similarity matrix only tells you how closely related two sequences are but not what kind of motif/sequence they share.

ADD REPLY

Login before adding your answer.

Traffic: 2068 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6