Gene Expression: Clustering Co-Expressed Genes
4
9
Entering edit mode
13.0 years ago

Hi,

Given a microarray experiment data set, how would you create clusters of co-expressed genes?

The data set can best be simplified as an matrix of GxI dimensions. G number of genes and I number of individuals. The aim is to cluster together genes for which the expression is highly (read significantly) correlated across all the individuals, ideally after having removed the effect of factors responsible experimental variation and biases.

The final result is a statistically based grouping of genes in such clusters from which the individual gene ID can be recovered.

(NOTE: I am not looking for a software to make a visual clustering of the genes)

Any ideas?

gene clustering microarray • 8.3k views
5
Entering edit mode
13.0 years ago
toni ★ 2.2k

In the section "Cluster Analysis" from the book Bioinformatics and Computational Biology Solutions using R and Bioconductor, HOPACH clustering is described. HOPACH stands for Hierarchical Ordered Partioning and Collapsing Hybrid. As its name suggests, this is a hybrid method of partitioning and hierarchical cluster analysis, which recursively alternates splitting and collapsing steps based on a criterion called median split silhouhette (MSS). MSS is a measure aiming at answering the question : shoud I split this (sub-)cluster again or is it homogeneous enough ? The silhouette of a gene is a measure indicating how well this gene fits into its own cluster comparing to other clusters. So maximizing the MSS criterion allows one to decide when splitting should be stopped and then remaining unsplit clusters are "what you are looking for".

There is a R (bioconductor) package available : hopach. As well as the package, you will also find a detailled manuscript (pdf) of the methodology.

This package provides a bootstrap resampling function allowing one to obtain membership estimates for a gene in each cluster.

Here are useful references associated with this package/method :

Van der Laan and Pollard. Hybrid clustering of gene expression data with visualization and the bootstrap. 2003. Journal of Statistical Planning and Inference.

K.Pollard and M. van der Laan. A method to identify significant clusters in gene expression data. In SCI2002 Proceedings, volume II, pages 318-325, Orlando, 2002a. International Institute of Informatics and Systemics.

K.Pollard and M. van der Laan. Statistical inference for simultaneous clustering of gene expression data. Mathematical Biosciences, 176(1):99-121, 2002b.

This may be a starting point to get what you want, that is to say highly correlated or significantly coexpressed genes.

Hope it helps. I was thinking that this package should be interesting.

(I do not have personally a strong experience with this package)

regards,

tony.

4
Entering edit mode
13.0 years ago
Ian Simpson ▴ 960

Finding co-regulated genes is actually quite a difficult task, using straight clustering like k-means is in my experience not that productive on straight expression data. As I am sure you're aware k-means is by far and away the most commonly used clustering method. Any one clustering method has it's problems and the lack of associated statistics and statistically informed decision making when doing things like picking cluster numbers for the partitioning are big problems.

Now to what I do. Firstly no harm in doing what has been proposed, but what I bet you will find is profiles that are mainly segregated on the basis of variations in expression magnitude rather than expression shape and that is a big problem biologically. When you think about what (I believe) you are looking for, you are trying to find genes that share expression profiles, that is shapes. I often show a simulated example in lectures of why you will often end up co-clustering genes based on magnitude variation rather than similarity in shape if you use the standard approach. The way past this is actually very simple and we have found it incredibly useful in identifying co-regulated genes in Drosophila PNS development.

Take your original expression matrix [genes x conditions] and unitise the vectors, that is make the matrix magnitude invariant. The way to do this is to normalise the rows by the length of their vectors

#for an expression matrix sim_class, get the length sqrt(sum of squares)
norm_factors <- sqrt(apply(sim_class^2,1,sum));

#divide the rows by the norm factor
normalised_sim_class <- sim_class/norm_factors;


Now clustering this expression matrix pulls out genes that have the same shape irrespective of magnitude. Biologically this means you are pulling together genes that might be direct targets of a particular set of transcription factors, with identical profiles but simply a different response scale. This is what you often find in reality, it is more often the case that co-regulated genes are responding in different scales, but in our experience with similar expression profile shapes. I hope that's of some use, even if only to run alongside your current analyses to see the differences this approach produces.

0
Entering edit mode

I'm curious how the clustering result from magnitude normalization (using Euclidean distance?) compares with clustering on the raw data using correlation...

3
Entering edit mode
13.0 years ago
Paulo Nuin ★ 3.7k

Start with k-means, either w/ an arbitrary number of clusters or w/ some pre-defined number. You can also use some self-learning k-means clustering and from that determine the best number of clusters with some expression profiles. From there you can try some other advanced techniques to filter the data more and more.

1
Entering edit mode

You can do k-means clustering in MeV or R.

0
Entering edit mode

@nuin Any specific recommendation about an approach, either software, R package...?

0
Entering edit mode

Yes, I would start with MeV, might be simpler to use and more straightforward. Also, the profile graphs help a lot.

0
Entering edit mode
13.0 years ago

Roughly speaking clustering methods require you to define a metric that you will use to quantify the similarity between two elements, and a linkage that deals with combining the similarities for the elements that happen to be classified in the same cluster.

Traffic: 1100 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.