Question

how to calculate the correlation and p values for all combination of a gene expression data?

4

Entering edit mode

10.4 years ago

Mo ▴ 920

Hello,

I have found so many techniques and posts to calculate correlation coefficient of a given Matrix data in R.

E.g. rrcor, cor etc . most of these comments were not useful when the data set is huge, for example in my case I have a Microarray data of 40000 rows (genes) and 3000 columns (samples)

Some example posts which I tried were as follows:

Gene-gene Pearson Correlation

Check For Co-Expressed Genes In Microarray Experiments

When the data is huge, these approaches are either not working (e.g giving errors) or block or ....

I would like to calculate the correlation and p value of each pairs of genes and then rank them. Is there any useful approach ? How to group similar genes?

Microarray R p-values Correlation-Coefficient • 7.0k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.4 years ago by Mo ▴ 920

0

Entering edit mode

Looks like there are too many questions in one:

What should I use to test my geneXgene correlations?
How to adjust my p-values to get the significant ones (Multiple Hypothesis Testing) or how to rank (arbitrary threshold)?
What is the general approach for that kind of problem?
How to cluster similar genes?

What is precisely your goal here?

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by toni ★ 2.2k

1

Entering edit mode

@toni Thanks for this comment. In fact, you are right so many small questions at once!

Lets imagine I have a big matrix which I want to rank the genes based on their expression. I don't have any phenotype, I don't have any reference matrix , what I have is a Matrix, each row corresponds to a gene and each column corresponds to a sample

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Mo ▴ 920

0

Entering edit mode

Define "not working".

I imagine that you're running into memory issues since you need 1.6 billion floating point values and R isn't known for being terribly memory efficient.

ADD REPLY • link 10.4 years ago by Devon Ryan 105k

0

Entering edit mode

Yes for sure , definition of not working here = bloody freezing computer!

= Not being able to click or work with your computer forever

= Not being able to know whether it is working or just looping around :-D

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Mo ▴ 920

0

Entering edit mode

It's likely swapping and thereby grinding the computer to a halt. Either use a computer with more memory (I wouldn't use anything with less than 16 gigs for this if you're using R) or implement this in C or another lower level language where you can control memory usage.

ADD REPLY • link 10.4 years ago by Devon Ryan 105k

Ram · Answer 1 · 2015-02-11

2

Entering edit mode

10.4 years ago

Sean Davis 27k

Gene-by-gene correlation: First, ask yourself if all genes are expressed? Second, ask yourself if all genes are accurately measured? Third, ask yourself if the gene expression measures for each gene carry any useful information (do they vary)? The answer to each of these questions is inevitably, "no", so your list of 40,000 genes will quickly become something much smaller (say 15k or less, even).

Grouping similar genes: This is typically done with clustering of some type. However, not all clustering algorithms require O(n^2) computation time and memory like correlation. Consider kmeans clustering or even self-organizing maps to group genes with similar expression patterns.

P-values: Well, this one is tough for two reasons. First, unsupervised methods of data analysis such as clustering and correlation do not lend themselves to hypothesis testing very well; they are better at hypothesis generation. Second, when correcting for multiple testing of billions of tests, it may be difficult to find ANYTHING that is statistically significant. Therefore, I would drop the p-value requirement and focus on the clustering exercise as a hypothesis-generating exercise and try to layer biological knowledge on the clusters that you generate to help (gene ontology, literature, GSEA, etc.).

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Sean Davis 27k

1

Entering edit mode

Thanks @Sean Davis for your comment! However, I don't agree with your first statements where you always say ask yourself! :-D

Imagine you have over 40000 genes, how would you ask yourself which one was expressed or which one was not or accurately measured or not ! I suppose one who runs an experiment aims at measuring them accurately , even if there is gross or systematic error you cannot tell in advance (some people just throw one gene/few/ or even half of the data out since it is not in cluster or it is behaving differently) while I don't want to just get a fit, I am more searching to understand the data rather than publishing something!

How can you ask whether a gene carries enough information if you don't have a phenotype or any other dependent variable ? Therefore, I wish I could agree with those few comments above, but I am not since it is very vague to say I don't like that gene or I keep this gene for further analysis but I throw the rest away! (because they might be useless) I even don't do that to noise :-D :-D

For sure, the p value is the tough guy!

Honestly, I could not find a differential expression technique which allow you to only work with a matrix , all of them need a phynotype or a reference matrix, I have already checked Limma, BitSeq, AffyExpress, dexus, bridge and many others!

if you are notified of any package which allows to differentiate genes based on a single Matrix (unsupervised way) please don't hesitate to share! I will check it out

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Mo ▴ 920

0

Entering edit mode

Gene expression is always a "relative" measure on arrays and RNA-seq, so one needs to compare one group to another for hypothesis testing. What you are trying to do is "unsupervised analysis", so you could do some searching in the literature for that topic to see what you can turn up. That said, unsupervised analysis is often focused more on finding sets of samples that behave similarly. Finally, note that unsupervised analysis is hypothesis-generating and is really not terribly useful for "proving" things.

As for winnowing down your gene list, I'd suggest taking the top 25% of the most variable genes. Those are variable, most likely expressed in some samples (since they are variable), carry information, and are often accurately measured (as opposed to being simply experimental noise).

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Sean Davis 27k

Ram · Answer 2 · 2015-02-11

0

Entering edit mode

10.4 years ago

merodev ▴ 150

Have you tried LaF package to read your data? It works great with big sets of data as it does not load your data to RAM. cor then is quite fast to work on expression data.

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by merodev ▴ 150

0

Entering edit mode

Thanks for your comment but this package did not help neither!

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Mo ▴ 920