Question: Clustering biological sequences based on numeric values
0
keshavmot20 wrote:

I am trying to cluster several amino acid sequences of a fixed length (13) into K clusters based on the Atchley factors (5 numbers which represent each amino acid.

For example, I have an input vector of strings like the following:

``````  key <- HDMD::AAMetric.Atchley

sequences <- sapply(1:10000, function(x) paste(sapply(1:13, function (X) sample(rownames(key), 1)), collapse = ""))
``````

However, my actual list of sequences is over 10^5 (specifying for need for computational efficiency).

I then convert these sequences into numeric vectors by the following:

``````  m1 <- key[strsplit(paste(sequences, collapse = ""), "")[], ]
p = 13
output <-
do.call(cbind, lapply(1:p, function(i)
m1[seq(i, nrow(m1), by = p), ]))
``````

I want to output (which is now 65 dimensional vectors) in an efficient way.

I was originally using Mini-batch kmeans, but I noticed the results were very inconsistent when I repeated. I need a consistent clustering approach.

I also was concerned about the curse of dimensionality, considering at 65 dimensions, Euclidean distance doesn't work.

Many high dimensional clustering algorithms I saw assume that outliers and noise exists in the data, but as these are biological sequences converted to numeric values, there is no noise or outlier.

In addition to this, feature selection will not work, as each of the properties of each amino acid and each amino acid are relevant in the biological context.

How would you recommend clustering these vectors?

modified 2.9 years ago by Jean-Karim Heriche24k • written 2.9 years ago by keshavmot20

Can you not prioritise certain dimensions and then just focus on those in pairwise plots?

Alternatively, you could attempt to 'summarise' each dimension into a single vector of eigenvalues and, through that, merge the 65 dimensions into a single data matrix (I have done this in the past). You may also take inspiration from t-SNE and other high dimensional mass cytometry data processing algorithms. See Algorithmic Tools for Mining High-Dimensional Cytometry Data.

0
Jean-Karim Heriche24k wrote:

Clustering algorithms don't necessarily have a notion of noise and even when they do, they actually perform better when there is none in the data. The problem is not so much the noise as the separability of the clusters in the feature space. Typically this is dealt with by finding the right feature space either by using features that are relevant for the task or by using a relevant measure of similarity/distance and by applying a suitable clustering algorithm. For example, k-means is most suitable for spherical clusters that are well separated. If, in the original feature space, the clusters have non-linearly separable shapes, a transformation like PCA may make them linearly separable. One issue here is that the 65 dimensions correspond, by groups of 5, to positions in the sequence and as such their order in the vector matters. To compute most distance/similarity measures, the order of dimensions is irrelevant. If you want to cluster sequences preserving information on the ordering within the sequence then you need to use appropriate methods. For example here, you could use a distance based on dynamic time warping (R package dtw). I don't know these Atchley factors but another potential issue could be that the 5 values per amino-acid may be correlated.