Question: Gene Expression: Clustering Co-Expressed Genes
gravatar for Eric Normandeau
10.8 years ago by
Quebec, Canada
Eric Normandeau10k wrote:


Given a microarray experiment data set, how would you create clusters of co-expressed genes?

The data set can best be simplified as an matrix of GxI dimensions. G number of genes and I number of individuals. The aim is to cluster together genes for which the expression is highly (read significantly) correlated across all the individuals, ideally after having removed the effect of factors responsible experimental variation and biases.

The final result is a statistically based grouping of genes in such clusters from which the individual gene ID can be recovered.

(NOTE: I am not looking for a software to make a visual clustering of the genes)

Any ideas?

gene microarray clustering • 7.4k views
ADD COMMENTlink written 10.8 years ago by Eric Normandeau10k
gravatar for toni
10.8 years ago by
toni2.2k wrote:

In the section "Cluster Analysis" from the book Bioinformatics and Computational Biology Solutions using R and Bioconductor, HOPACH clustering is described. HOPACH stands for Hierarchical Ordered Partioning and Collapsing Hybrid. As its name suggests, this is a hybrid method of partitioning and hierarchical cluster analysis, which recursively alternates splitting and collapsing steps based on a criterion called median split silhouhette (MSS). MSS is a measure aiming at answering the question : shoud I split this (sub-)cluster again or is it homogeneous enough ? The silhouette of a gene is a measure indicating how well this gene fits into its own cluster comparing to other clusters. So maximizing the MSS criterion allows one to decide when splitting should be stopped and then remaining unsplit clusters are "what you are looking for".

There is a R (bioconductor) package available : hopach. As well as the package, you will also find a detailled manuscript (pdf) of the methodology.

This package provides a bootstrap resampling function allowing one to obtain membership estimates for a gene in each cluster.

Here are useful references associated with this package/method :

Van der Laan and Pollard. Hybrid clustering of gene expression data with visualization and the bootstrap. 2003. Journal of Statistical Planning and Inference.

K.Pollard and M. van der Laan. A method to identify significant clusters in gene expression data. In SCI2002 Proceedings, volume II, pages 318-325, Orlando, 2002a. International Institute of Informatics and Systemics.

K.Pollard and M. van der Laan. Statistical inference for simultaneous clustering of gene expression data. Mathematical Biosciences, 176(1):99-121, 2002b.

This may be a starting point to get what you want, that is to say highly correlated or significantly coexpressed genes.

Hope it helps. I was thinking that this package should be interesting.

(I do not have personally a strong experience with this package)



ADD COMMENTlink written 10.8 years ago by toni2.2k
gravatar for Ian Simpson
10.8 years ago by
Ian Simpson950
Ian Simpson950 wrote:

Finding co-regulated genes is actually quite a difficult task, using straight clustering like k-means is in my experience not that productive on straight expression data. As I am sure you're aware k-means is by far and away the most commonly used clustering method. Any one clustering method has it's problems and the lack of associated statistics and statistically informed decision making when doing things like picking cluster numbers for the partitioning are big problems.

Now to what I do. Firstly no harm in doing what has been proposed, but what I bet you will find is profiles that are mainly segregated on the basis of variations in expression magnitude rather than expression shape and that is a big problem biologically. When you think about what (I believe) you are looking for, you are trying to find genes that share expression profiles, that is shapes. I often show a simulated example in lectures of why you will often end up co-clustering genes based on magnitude variation rather than similarity in shape if you use the standard approach. The way past this is actually very simple and we have found it incredibly useful in identifying co-regulated genes in Drosophila PNS development.

Take your original expression matrix [genes x conditions] and unitise the vectors, that is make the matrix magnitude invariant. The way to do this is to normalise the rows by the length of their vectors

#for an expression matrix sim_class, get the length sqrt(sum of squares)
norm_factors <- sqrt(apply(sim_class^2,1,sum));

#divide the rows by the norm factor
normalised_sim_class <- sim_class/norm_factors;

Now clustering this expression matrix pulls out genes that have the same shape irrespective of magnitude. Biologically this means you are pulling together genes that might be direct targets of a particular set of transcription factors, with identical profiles but simply a different response scale. This is what you often find in reality, it is more often the case that co-regulated genes are responding in different scales, but in our experience with similar expression profile shapes. I hope that's of some use, even if only to run alongside your current analyses to see the differences this approach produces.

ADD COMMENTlink modified 2.4 years ago by _r_am32k • written 10.8 years ago by Ian Simpson950

I'm curious how the clustering result from magnitude normalization (using Euclidean distance?) compares with clustering on the raw data using correlation...

ADD REPLYlink written 10.7 years ago by Hanif Khalak1.2k
gravatar for Paulo Nuin
10.8 years ago by
Paulo Nuin3.7k
Paulo Nuin3.7k wrote:

Start with k-means, either w/ an arbitrary number of clusters or w/ some pre-defined number. You can also use some self-learning k-means clustering and from that determine the best number of clusters with some expression profiles. From there you can try some other advanced techniques to filter the data more and more.

ADD COMMENTlink written 10.8 years ago by Paulo Nuin3.7k

You can do k-means clustering in MeV or R.

ADD REPLYlink written 10.8 years ago by Madelaine Gogol5.2k

@nuin Any specific recommendation about an approach, either software, R package...?

ADD REPLYlink written 10.8 years ago by Eric Normandeau10k

Yes, I would start with MeV, might be simpler to use and more straightforward. Also, the profile graphs help a lot.

ADD REPLYlink written 10.8 years ago by Paulo Nuin3.7k
gravatar for Istvan Albert
10.8 years ago by
Istvan Albert ♦♦ 86k
University Park, USA
Istvan Albert ♦♦ 86k wrote:

Roughly speaking clustering methods require you to define a metric that you will use to quantify the similarity between two elements, and a linkage that deals with combining the similarities for the elements that happen to be classified in the same cluster.

ADD COMMENTlink written 10.8 years ago by Istvan Albert ♦♦ 86k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1327 users visited in the last hour