4.2 years ago by
National Institutes of Health, Bethesda, MD
Gene-by-gene correlation: First, ask yourself if all genes are expressed? Second, ask yourself if all genes are accurately measured? Third, ask yourself if the gene expression measures for each gene carry any useful information (do they vary)? The answer to each of these questions is inevitably, "no", so your list of 40,000 genes will quickly become something much smaller (say 15k or less, even).
Grouping similar genes: This is typically done with clustering of some type. However, not all clustering algorithms require O(n^2) computation time and memory like correlation. Consider kmeans clustering or even self-organizing maps to group genes with similar expression patterns.
P-values: Well, this one is tough for two reasons. First, unsupervised methods of data analysis such as clustering and correlation do not lend themselves to hypothesis testing very well; they are better at hypothesis generation. Second, when correcting for multiple testing of billions of tests, it may be difficult to find ANYTHING that is statistically significant. Therefore, I would drop the p-value requirement and focus on the clustering exercise as a hypothesis-generating exercise and try to layer biological knowledge on the clusters that you generate to help (gene ontology, literature, GSEA, etc.).