I have been dealing with a new issue regarding clustering/classifying sequences.
I need an algorithm that can cluster huge number of short sequences. To simulate the problem I have made some visuals and a small set of sequences.
From the image it should be clear that there are 2 conserved sequences, the blue and the orange. The green shows variations
What I need is to label the sequences based on what conserved sequence they contain, so that there will be two labels; label A: 0,1,2,5,6,7,9,10 and label B: 0, 3, 4, 5, 7, 8, 11
If I build a similarity matrix based of pairwise alignment I build this matrix:
But I am not sure how to process the matrix further to label them. Anyone has any idea?