Question: how to calculate the correlation and p values for all combination of a gene expression data?
gravatar for Mo
4.2 years ago by
Mo880 wrote:


I have found so many techniques and posts to calculate correlation coefficient of a given Matrix data in R. 

E.g. rrcor , cor etc . most of these comments were not useful when the data set is huge , for example in my case  I have a Microarray data of 40000 rows (genes) and 3000 columns (samples)

Some example posts which I tried were as follows:

Gene-gene Pearson Correlation

Check For Co-Expressed Genes In Microarray Experiments

When the data is huge, these approaches are either not working (e.g giving errors) or block or .... 


I would like to calculate the correlation and p value of each pairs of genes and then rank them. Is there any useful approach ? How to group similar genes? 



ADD COMMENTlink modified 4.2 years ago by merodev140 • written 4.2 years ago by Mo880

Looks like there are too many questions in one :

1 - What should I use to test my geneXgene correlations ?

2 - How to adjust my p-values to get the significant ones (Multiple Hypothesis Testing) or how to rank (arbitrary threshold) ? 

3 - What is the general approach for that kind of problem ?

4 - How to cluster similar genes ?

What is precisely your goal here ?

ADD REPLYlink written 4.2 years ago by toni2.1k

@toni Thanks for this comment. In fact, you are right so many small questions at once! 

Lets imagine I have a big matrix which I want to rank the genes based on their expression. I don't have any phynotype, I don't have any reference matrix , what I have is a Matrix, each row corresponds to a gene and each column corresponds to a sample 


ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Mo880

Define "not working".

I imagine that you're running into memory issues since you need 1.6 billion floating point values and R isn't known for being terribly memory efficient.

ADD REPLYlink written 4.2 years ago by Devon Ryan89k

Yes for sure , definition of not working here = bloody freezing computer ! 

= Not being able to click or work with your computer forever 

= Not being able to know whether it is working or just looping around :-D 

ADD REPLYlink written 4.2 years ago by Mo880

It's likely swapping and thereby grinding the computer to a halt. Either use a computer with more memory (I wouldn't use anything with less than 16 gigs for this if you're using R) or implement this in C or another lower level language where you can control memory usage.

ADD REPLYlink written 4.2 years ago by Devon Ryan89k
gravatar for Sean Davis
4.2 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

Gene-by-gene correlation:  First, ask yourself if all genes are expressed?  Second, ask yourself if all genes are accurately measured?  Third, ask yourself if the gene expression measures for each gene carry any useful information (do they vary)?  The answer to each of these questions is inevitably, "no", so your list of 40,000 genes will quickly become something much smaller (say 15k or less, even).

Grouping similar genes:  This is typically done with clustering of some type.  However, not all clustering algorithms require O(n^2) computation time and memory like correlation.  Consider kmeans clustering or even self-organizing maps to group genes with similar expression patterns.  

P-values:  Well, this one is tough for two reasons.  First, unsupervised methods of data analysis such as clustering and correlation do not lend themselves to hypothesis testing very well; they are better at hypothesis generation.  Second, when correcting for multiple testing of billions of tests, it may be difficult to find ANYTHING that is statistically significant.  Therefore, I would drop the p-value requirement and focus on the clustering exercise as a hypothesis-generating exercise and try to layer biological knowledge on the clusters that you generate to help (gene ontology, literature, GSEA, etc.).


ADD COMMENTlink written 4.2 years ago by Sean Davis25k

Thanks @Sean Davis for your comment! However, I don't agree with your first statements where you always say ask yourself! :-D

Imagine you have over 40000 genes, how would you ask yourself which one was expressed or which one was not or accurately measured or not ! I suppose one who runs an experiment aims at measuring them accurately , even if there is gross or systematic  error you cannot tell in advance (some people just throw one gene/few/ or even half of the data out since it is not in cluster or it is behaving differently) while I don't want to just get a fit, I am more searching to understand the data rather than publishing something ! 

how can you ask whether a gene carries enough information if you don't have a phenotype or any other dependent variable ? Therefore, I wish I could agree with those few comments above, but I am not since it is very vague to say I don't like that gene or I keep this gene for further analysis but I throw the rest away! (because they might be useless) I even don't do that to noise :-D :-D  

For sure, the p value is the tough guy! 

Honestly, I could not find a differential expression technique which allow you to only work with a matrix , all of them need a phynotype or a reference matrix, I have already checked Limma, BitSeq ,  AffyExpress , dexus , bridge and many others ! 

if you are notified of any package which allows to differentiate genes based on a single Matrix (unsupervised way) please don't hesitate to share! I will check it out 


ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Mo880

Gene expression is always a "relative" measure on arrays and RNA-seq, so one needs to compare one group to another for hypothesis testing.  What you are trying to do is "unsupervised analysis", so you could do some searching in the literature for that topic to see what you can turn up.  That said, unsupervised analysis is often focused more on finding sets of samples that behave similarly.  Finally, note that unsupervised analysis is hypothesis-generating and is really not terribly useful for "proving" things.  

As for winnowing down your gene list, I'd suggest taking the top 25% of the most variable genes.  Those are variable, most likely expressed in some samples (since they are variable), carry information, and are often accurately measured (as opposed to being simply experimental noise).

ADD REPLYlink written 4.2 years ago by Sean Davis25k
gravatar for merodev
4.2 years ago by
United States
merodev140 wrote:

Have you tried LaF package to read your data? It works great with big sets of data as it does not load your data to RAM. cor then is quite fast to work on expression data.


ADD COMMENTlink written 4.2 years ago by merodev140

Thanks for your comment but this package did not help neither ! 

ADD REPLYlink written 4.2 years ago by Mo880
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1552 users visited in the last hour