Question: Check For Co-Expressed Genes In Microarray Experiments
gravatar for elb
8.0 years ago by
elb190 wrote:

Hi, I have a matrix of n genes (rows) and m columns (samples) from a microarray gene expression experiment performed using Affymetrix technology. I have to check for co-expressed genes (up or down regulated gens), and in particular if a subset of genes show a correlation in expression. Then I have to visualize this information using for ex a heat map. Searching on the web, I found a pletora of methods for my purpose: Pearson correlation coefficient for correlation, k-means for cluster analysis and so on..My question is: is there a method that outperforms the others for my purpose? I'm very confused about the this.

Thanks a lot,


ADD COMMENTlink modified 6.0 years ago by mjoyraj80 • written 8.0 years ago by elb190
gravatar for Ben
8.0 years ago by
Edinburgh, UK
Ben2.0k wrote:

Outperforms is a vague term, especially when you haven't defined what you consider coexpressed genes. Pearson correlation coefficient is widely used for measuring coexpression of (properly normalised) microarray expression data. The bigger problem is really defining the cutoff at which you consider two genes coexpressed. On a genome-wide basis and if you have a large number of different experiments, using something like R's cor.test you'll get a huge number of genes which appear significantly correlated (even after multiple testing correction). Many of these will have consistently low expression across all samples so you may want to filter your dataset to leave only those probesets which show some level of variability across your samples. Instead of using significance testing, it may be more useful for you to optimise your own cutoff value of r, depending on what it is you're really looking for.

Importantly, Pearson correlation coefficient measures pairwise correlation whereas a clustering method will report groups of genes with similar expression profiles. Though it is possible to apply PCC on an all-vs-all basis to create a correlation matrix, then use this to build a coexpression network and extract cliques or subgraphs of this network to find mutually coexpressed groups. Again, with this method you need to define a cutoff where you deem the coexpression significant.

It's straightforward to produce a heatmap with clustering (see this great tutorial) but this will likely only be meaningful / useful for some subset or with regards to a priori groups of genes.

Finally, depending on what organism you're working on there may already be well-described databases of coexpressed genes, e.g. for Arabidopsis there's ACT and CressExpress, for others there's CoexpresDB and probably lots more. Even if they don't give the data you want, they're probably useful to validate your results.

ADD COMMENTlink written 8.0 years ago by Ben2.0k

Hi Ben!!! Thanks a lot for your answer!

ADD REPLYlink written 8.0 years ago by elb190
gravatar for Sean Davis
8.0 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

Heatmaps are visualization tools, not analysis methods, so there is not a way to "rate" their "performance". Try a few and see what shows your data in the way that you think is most appropriate for your data. If you have something else in mind besides simply displaying data, could you clarify your experimental design and your biological questions?

ADD COMMENTlink written 8.0 years ago by Sean Davis26k

Hi Sean! I just have a gene expression microarray matrix with 300 samples (more or less) and 14.000 genes (more or less). I just wont to see if a target gene (ex. TP53) is co-expressed with another group of genes. My question is: is the Pearson correlation coefficient the best way to do this? And about the visualization tool: what is the best tool to visualize my result?

ADD REPLYlink written 8.0 years ago by elb190

Pearson is probably fine. Try a parallel-coordinate plot or heatmap to show the results of many genes at once. Use a scatterplot to show two genes at a time (TP53 vs Gene X).

ADD REPLYlink written 8.0 years ago by Sean Davis26k
gravatar for seidel
8.0 years ago by
United States
seidel7.1k wrote:

Here's a simple approach, related to what Ben and Sean have said, but there are some things you'll need to clarify. With measurements on 300 samples, this likely represents data from several experiments, and you'll have to be explicitly clear about what your data actually represents. Affymetrix technology measures transcript levels per sample. However, most experiments are designed such that one is assessing changes in transcript levels between conditions, thus changes are relative, and absolute abundance is not known. In terms of assessing co-expression, knowing whether your data represents absolute expression levels across 300 conditions, or relative levels across 300 conditions is important because the distance measures you would use to define similarity would imply different things in each case. Which measure does your data table represent? Affymetrix abundance data is often turned into relative abundance data by creating ratios of experiment over control for each gene, which is very useful in general in terms of thinking about the biology, but then absolute abundance information is lost. Either way, whether you have 300 ratios of gene expression, or 300 intensity measurements (i.e. abundance) of gene expression, you have a profile of gene expression. So the next thing to be clear about is how similarity of profiles is quantified, and what is implied vis a vis coexpression. If you have absolute expression data, and the definition of co-expression is a gene with the closest abundance profile, then you would use Euclidean distance as a similarity measure (this is the default measure using R's heatmap function). However, there's no hard and fast rule that two genes which are co-expressed across a plethora of biological stimuli are each expressed at the same concentration within the cell, so perhaps similarity of profile (regardless of absolute abundance) is sufficient, in which case correlation would be a good measure. By the way, if this doesn't make sense, look up and think about what each thing measures. consider three genes with the following profiles. g1: 125,400,800,1200; g2: 125,400,800,1200; g3: 425,700,1100,1500. You can see that g1 and g2 have identical profiles, and by both Eulidean and correlation distance measures, they are identical. However, assessing their similarity to g3, by Pearson correlation, g1, g2, and g3 are all identical, whereas by Euclidean distance g3 is different than g1 and g2. When it comes to ratios of expression, absolute abundance is out the window, but you can still assess similarity of profiles. From a biological perspective, similarity of profile is often considered co-expression, but you should think about the implications for how the measures above score similarity when examining profiles composed of gene expression ratios.

The answer to your question may depend on certain particulars of your data (thus be clear about it). Define your data set, define co-expression, define your purpose. But in my experience, for most biological problems, I would say try a number of clustering methods, see how they differ, see what they offer you in terms of organizing your data. I've found that when I create toy data sets with known combinations of profiles, there is no one method or solution that can pull them all out and re-organize them perfectly. Depending on your level of expertise, an easy package that allows you to experiment with many methods, visualize the results, and know hardly anything going in, is called MeV (MulitExperimentViewer).

ADD COMMENTlink written 8.0 years ago by seidel7.1k
gravatar for Manu Prestat
7.9 years ago by
Manu Prestat4.0k
Lyon, France
Manu Prestat4.0k wrote:

Look at this paper which compares 3 ways of making networks from your kind of data. "Simple" networks are called "relevance networks" (using mutual information or correlation which are very close): they are interesting but they won't tell you so much more than a heatmap based clustering. Instead you should take a look at "graphical gaussian models" or "Bayesian Networks" which are able to make the difference between independence and conditional independence, thus improving the refinement of your network inference.

Werhli, A. V., Grzegorczyk, M., & Husmeier, D. (2006). Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical gaussian models and bayesian networks. Bioinformatics (Oxford, England), 22(20), 2523–2531. doi:10.1093/bioinformatics/btl391

ADD COMMENTlink written 7.9 years ago by Manu Prestat4.0k
gravatar for mjoyraj
6.0 years ago by
mjoyraj80 wrote:

I have a similar query. My basic idea is to identify transcription factor binding site (TFBS) upstream of keratin gene. I want to do de-novo discovery based on over-represented sequence search in regulatory regions of keratin gene. Therefore, my first idea is to search genes co-expressed with my target gene (i.e.) keratin. I have 15 RNA seq gene expression data developed from developing feather cells. What strategy I can use to sort the co-expressed genes???


Dr. M. Joyraj Bhattacharjee 

ADD COMMENTlink written 6.0 years ago by mjoyraj80

Your post is not an answer to a the original question.  Instead, you should ask a new question.

ADD REPLYlink written 6.0 years ago by Sean Davis26k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 834 users visited in the last hour