Clustering sequence on similarity using percentage identity matrix
2
0
Entering edit mode
6.0 years ago

Hi All,

I have a set of 400 nucleotide sequences that I want to cluster on basis of similarity. For clustering, I am expecting similarity <=45% among members of a cluster. Also, there will be a few sequences that do not show similarity to any other members. Is there any clustering approach that allow us to set a cut-off for relation (similarity) between members? and can keep the members with very low similarity to a "unclustered" set?

I have generated percentage identity matrix (400 x 400) using clustal-omega, and using this matrix for clustering by "affinity-propagation" approach is not giving good results.

p.s. I have had used "cd-hit" and "uclust" already but they are not recommended for cases when expected sequence similarity is below 70%.

alignment r sequence • 6.1k views
2
Entering edit mode
6.0 years ago
alolex ▴ 910

Have you tried simple hierarchical clustering?  You can do this in R with the hclust() and cutree() methods on your 400x400 matrix, and heatmap.2() for visualization. The more clusters you specify with cutree() the more outliers you will get that have low similarity to the other sequences.

For example you could try, where ident_mtx is the 400x400 matrix:

hc <- hclust(as.dist(ident_mtx), method="ward.D2")
mycut <- cutree(hc, k=10)
heatmap.2(ident_mtx, Rowv=as.dendrogram(hc), Colv=as.dendrogram(hc))
3
Entering edit mode
6.0 years ago

You are essentially just using percent identity as distance measures for clustering. You can use any clustering method (k-means, hierarchical, mcl, dbscan...) available out there.

MCL is pretty easy to use. MCL also likes undirected graphs, which you have.

All you really have to do is convert your percent identities into a tab delimited format like this:

seqA seqB 0.4
seqB seqC 0.5
...
..

And run MCL on it like this:

mcl input.data --abc -o output.clusters

It outputs a file where each line is a cluster.

You should maybe think about whether just using percent identity is a good way to cluster your sequences though.

0
Entering edit mode

@Damian - Thanks for suggestion. What would you suggest instead of percent identity? The sequences are not protein-coding genes. Would using distance matrix from Clustal-Omega be better?