Self-Learning Gene-Expression K-Means Clustering In R
Entering edit mode
12.8 years ago


I want to cluster gene expression in R using kmeans (or some other function/package) and I would like that the clustering be 'intelligent', in the sens that some within-cluster dissimilarity metric is being minimized, while avoiding over-splitting of clusters.

I have already tried kmeans, but do not want to specify an a-priory number of clusters. Here is the code:

data.xpr = read.table("my_data.txt") # Rows = 250 genes, cols = 32 individuals
clusters = kmeans(x = data.xpr, centers=20)

I am quite aware that there are a few other questions on the subject, but the answers are very broad and none permits to do what I would like to accomplish.

I would very much appreciate to have some code examples for R.


r clustering code • 17k views
Entering edit mode
12.8 years ago

Of course there is more to it, but you will get more experience by reading about and trying out different methods on your data.

All clustering methods try to optimize a certain objective function based on dissimilarities. That does not give you a good clue to decide for or against an algorithm. Some algorithms require to give an estimate on the number of clusters present. Hierarchical clustering on the other hand does not require that. It can be used in R using hclust() or the amap package. Visualization such as heatmap() or heatmap2() can also be useful. My tip, try hclust with Ward's inter-cluster distance, too.

Model based clustering is implemented in the Mclust package. I found it very useful for MA data. It is "intelligent" in your sense in that it tries to guess optimal parameters by optimizing an information criterion (BIC). From the manual:

  • Model-based clustering (model and number of clusters selected via BIC).
  • Normal mixture modeling via EM for ten covariance structures.
  • Simulation from parameterized Gaussian mixtures.
  • Discriminant analysis via MclustDA.
  • Model-based hierarchical clustering for four covariance structures.
  • Displays, including uncertainty plots and random projections.

Recommendation: try many different methods:

  • PCA
  • Discriminant analysis if you have annotation data
  • hierarchical clustering
  • Model based clustering
  • Self organizing maps (cran package SOM)

Some methods for assessing clusters were discussed in this question

Hope this gives you some hints of where to proceede.

Entering edit mode
12.7 years ago
D. Puthier ▴ 350


If you want to run the k-means partitioning algorithm on gene expression data I think you should better use the Kmeans function from the amap BioC library. Indeed, the default kmeans function use euclidean distance as dissimilarity metric. This is probably not the right choice (but it may depend on your needs...). A better solution could be:

data.xpr = read.table("my_data.txt") # Rows = 250 genes, cols = 32 individuals
clusters = Kmeans(x = data.xpr, centers=20, method="pearson")

Alternaltively, you can use use the DBFMCL algorithm (on a Linux OS as it requires MCL installation). It is implemented in the RTools4TB BioC package. Note that it has to be run on unfiltered datasets as it implements a filtering step based on density.

data.xpr <- read.table("my_data.txt") # The full dataset.
results  <- DBFMCL(data = m, distance.method = "pearson")
Entering edit mode
12.8 years ago

I think you're looking for some kind of "figure of merit" calculation. MeV has implemented this, and is a nice package for interactive clustering of gene expression data. For R, the clValid package might do what you need.

Entering edit mode
5.4 years ago

You could use my parallelised implementation of clusGap, which computes the gap statistic for a given dataset via PAM or k-means (or a custom metric): R functions edited for parallel processing

Entering edit mode
12.8 years ago
Will 4.5k

While I'm not sure about which R function there is, your probably looking for a "Chinese Restaurant Process" clustering.

I'm pretty sure there's an R library but its been a while since I've had a chance to look for it. I'm sure some google-ing would find it.

Entering edit mode

Will, he was looking for cluster analysis not for a stochastic process, -1 for a "random-pick" wikipedia link...


Login before adding your answer.

Traffic: 1433 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6