Question: Self-Learning Gene-Expression K-Means Clustering In R
gravatar for Eric Normandeau
10.5 years ago by
Quebec, Canada
Eric Normandeau10k wrote:


I want to cluster gene expression in R using kmeans (or some other function/package) and I would like that the clustering be 'intelligent', in the sens that some within-cluster dissimilarity metric is being minimized, while avoiding over-splitting of clusters.

I have already tried kmeans, but do not want to specify an a-priory number of clusters. Here is the code:

data.xpr = read.table("my_data.txt") # Rows = 250 genes, cols = 32 individuals
clusters = kmeans(x = data.xpr, centers=20)

I am quite aware that there are a few other questions on the subject, but the answers are very broad and none permits to do what I would like to accomplish.

I would very much appreciate to have some code examples for R.


code clustering R • 15k views
ADD COMMENTlink modified 2.2 years ago by _r_am31k • written 10.5 years ago by Eric Normandeau10k
gravatar for Michael Dondrup
10.5 years ago by
Bergen, Norway
Michael Dondrup48k wrote:

Of course there is more to it, but you will get more experience by reading about and trying out different methods on your data.

All clustering methods try to optimize a certain objective function based on dissimilarities. That does not give you a good clue to decide for or against an algorithm. Some algorithms require to give an estimate on the number of clusters present. Hierarchical clustering on the other hand does not require that. It can be used in R using hclust() or the amap package. Visualization such as heatmap() or heatmap2() can also be useful. My tip, try hclust with Ward's inter-cluster distance, too.

Model based clustering is implemented in the Mclust package. I found it very useful for MA data. It is "intelligent" in your sense in that it tries to guess optimal parameters by optimizing an information criterion (BIC). From the manual:

  • Model-based clustering (model and number of clusters selected via BIC).
  • Normal mixture modeling via EM for ten covariance structures.
  • Simulation from parameterized Gaussian mixtures.
  • Discriminant analysis via MclustDA.
  • Model-based hierarchical clustering for four covariance structures.
  • Displays, including uncertainty plots and random projections.

Recommendation: try many different methods:

  • PCA
  • Discriminant analysis if you have annotation data
  • hierarchical clustering
  • Model based clustering
  • Self organizing maps (cran package SOM)

Some methods for assessing clusters were discussed in this question

Hope this gives you some hints of where to proceede.

ADD COMMENTlink modified 14 months ago by _r_am31k • written 10.5 years ago by Michael Dondrup48k
gravatar for D. Puthier
10.4 years ago by
D. Puthier330
D. Puthier330 wrote:


If you want to run the k-means partitioning algorithm on gene expression data I think you should better use the Kmeans function from the amap BioC library. Indeed, the default kmeans function use euclidean distance as dissimilarity metric. This is probably not the right choice (but it may depend on your needs...). A better solution could be:

data.xpr = read.table("my_data.txt") # Rows = 250 genes, cols = 32 individuals
clusters = Kmeans(x = data.xpr, centers=20, method="pearson")

Alternaltively, you can use use the DBFMCL algorithm (on a Linux OS as it requires MCL installation). It is implemented in the RTools4TB BioC package. Note that it has to be run on unfiltered datasets as it implements a filtering step based on density.

data.xpr <- read.table("my_data.txt") # The full dataset.
results  <- DBFMCL(data = m, distance.method = "pearson")
ADD COMMENTlink modified 2.2 years ago by _r_am31k • written 10.4 years ago by D. Puthier330
gravatar for Michael Kuhn
10.5 years ago by
Michael Kuhn5.0k
EMBL Heidelberg
Michael Kuhn5.0k wrote:

I think you're looking for some kind of "figure of merit" calculation. MeV has implemented this, and is a nice package for interactive clustering of gene expression data. For R, the clValid package might do what you need.

ADD COMMENTlink modified 2.2 years ago by _r_am31k • written 10.5 years ago by Michael Kuhn5.0k
gravatar for Kevin Blighe
3.1 years ago by
Kevin Blighe68k
Republic of Ireland
Kevin Blighe68k wrote:

You could use my parallelised implementation of clusGap, which computes the gap statistic for a given dataset via PAM or k-means (or a custom metric): R functions edited for parallel processing

ADD COMMENTlink written 3.1 years ago by Kevin Blighe68k
gravatar for Will
10.5 years ago by
United States
Will4.5k wrote:

While I'm not sure about which R function there is, your probably looking for a "Chinese Restaurant Process" clustering.

I'm pretty sure there's an R library but its been a while since I've had a chance to look for it. I'm sure some google-ing would find it.

ADD COMMENTlink written 10.5 years ago by Will4.5k

Will, he was looking for cluster analysis not for a stochastic process, -1 for a "random-pick" wikipedia link...

ADD REPLYlink written 10.5 years ago by Michael Dondrup48k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1238 users visited in the last hour