how to calculate the correlation and p values for all combination of a gene expression data?
2
4
Entering edit mode
6.2 years ago
Mo ▴ 920

Hello,

I have found so many techniques and posts to calculate correlation coefficient of a given Matrix data in R. 

E.g. rrcor , cor etc . most of these comments were not useful when the data set is huge , for example in my case  I have a Microarray data of 40000 rows (genes) and 3000 columns (samples)

Some example posts which I tried were as follows:

Gene-gene Pearson Correlation

Check For Co-Expressed Genes In Microarray Experiments

When the data is huge, these approaches are either not working (e.g giving errors) or block or .... 

 

I would like to calculate the correlation and p value of each pairs of genes and then rank them. Is there any useful approach ? How to group similar genes? 

 

 

R Microarray Correlation Coefficient p values • 4.6k views
ADD COMMENT
0
Entering edit mode

Looks like there are too many questions in one :

1 - What should I use to test my geneXgene correlations ?

2 - How to adjust my p-values to get the significant ones (Multiple Hypothesis Testing) or how to rank (arbitrary threshold) ? 

3 - What is the general approach for that kind of problem ?

4 - How to cluster similar genes ?

What is precisely your goal here ?

ADD REPLY
1
Entering edit mode

@toni Thanks for this comment. In fact, you are right so many small questions at once! 

Lets imagine I have a big matrix which I want to rank the genes based on their expression. I don't have any phynotype, I don't have any reference matrix , what I have is a Matrix, each row corresponds to a gene and each column corresponds to a sample 

 

ADD REPLY
0
Entering edit mode

Define "not working".

I imagine that you're running into memory issues since you need 1.6 billion floating point values and R isn't known for being terribly memory efficient.

ADD REPLY
0
Entering edit mode

Yes for sure , definition of not working here = bloody freezing computer ! 

= Not being able to click or work with your computer forever 

= Not being able to know whether it is working or just looping around :-D 

ADD REPLY
0
Entering edit mode

It's likely swapping and thereby grinding the computer to a halt. Either use a computer with more memory (I wouldn't use anything with less than 16 gigs for this if you're using R) or implement this in C or another lower level language where you can control memory usage.

ADD REPLY
2
Entering edit mode
6.2 years ago

Gene-by-gene correlation:  First, ask yourself if all genes are expressed?  Second, ask yourself if all genes are accurately measured?  Third, ask yourself if the gene expression measures for each gene carry any useful information (do they vary)?  The answer to each of these questions is inevitably, "no", so your list of 40,000 genes will quickly become something much smaller (say 15k or less, even).

Grouping similar genes:  This is typically done with clustering of some type.  However, not all clustering algorithms require O(n^2) computation time and memory like correlation.  Consider kmeans clustering or even self-organizing maps to group genes with similar expression patterns.  

P-values:  Well, this one is tough for two reasons.  First, unsupervised methods of data analysis such as clustering and correlation do not lend themselves to hypothesis testing very well; they are better at hypothesis generation.  Second, when correcting for multiple testing of billions of tests, it may be difficult to find ANYTHING that is statistically significant.  Therefore, I would drop the p-value requirement and focus on the clustering exercise as a hypothesis-generating exercise and try to layer biological knowledge on the clusters that you generate to help (gene ontology, literature, GSEA, etc.).

 

ADD COMMENT
1
Entering edit mode

Thanks @Sean Davis for your comment! However, I don't agree with your first statements where you always say ask yourself! :-D

Imagine you have over 40000 genes, how would you ask yourself which one was expressed or which one was not or accurately measured or not ! I suppose one who runs an experiment aims at measuring them accurately , even if there is gross or systematic  error you cannot tell in advance (some people just throw one gene/few/ or even half of the data out since it is not in cluster or it is behaving differently) while I don't want to just get a fit, I am more searching to understand the data rather than publishing something ! 

how can you ask whether a gene carries enough information if you don't have a phenotype or any other dependent variable ? Therefore, I wish I could agree with those few comments above, but I am not since it is very vague to say I don't like that gene or I keep this gene for further analysis but I throw the rest away! (because they might be useless) I even don't do that to noise :-D :-D  

For sure, the p value is the tough guy! 

Honestly, I could not find a differential expression technique which allow you to only work with a matrix , all of them need a phynotype or a reference matrix, I have already checked Limma, BitSeq ,  AffyExpress , dexus , bridge and many others ! 

if you are notified of any package which allows to differentiate genes based on a single Matrix (unsupervised way) please don't hesitate to share! I will check it out 

 

ADD REPLY
0
Entering edit mode

Gene expression is always a "relative" measure on arrays and RNA-seq, so one needs to compare one group to another for hypothesis testing.  What you are trying to do is "unsupervised analysis", so you could do some searching in the literature for that topic to see what you can turn up.  That said, unsupervised analysis is often focused more on finding sets of samples that behave similarly.  Finally, note that unsupervised analysis is hypothesis-generating and is really not terribly useful for "proving" things.  

As for winnowing down your gene list, I'd suggest taking the top 25% of the most variable genes.  Those are variable, most likely expressed in some samples (since they are variable), carry information, and are often accurately measured (as opposed to being simply experimental noise).

ADD REPLY
0
Entering edit mode
6.2 years ago
merodev ▴ 140

Have you tried LaF package to read your data? It works great with big sets of data as it does not load your data to RAM. cor then is quite fast to work on expression data.

 

ADD COMMENT
0
Entering edit mode

Thanks for your comment but this package did not help neither ! 

ADD REPLY

Login before adding your answer.

Traffic: 1136 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6