Question: How To Test Whether Copy Number Aberrations Are Enriched In A Gene List
4
8.7 years ago by
user1202664200
user1202664200 wrote:

I begin with a matrix of genes (~30000 rows) vs cases (columns). I have binary - TRUE or FALSE - data in each cell indicating whether there is a copy number aberration spanning the gene for that case. Next, I sum each row to get the total number of cases from the sample set that have a copy number aberration in each gene. I will call this c for each gene. Next, I retrieve a list of genes from http://cbio.mskcc.org/CancerGenes/Select.action (say the Entrez Query: Stability list, which yields 1023 genes). How can I test whether c (the total number of cases with a copy number aberration for a given gene) among the Entrez Query: Stability list is significantly greater than c among the gene population in general? Furthermore, is there some way of calculating some uncertainty measure of my result based on the fact that not all genes actually underpinning stability are in the Entrez Query: Stability list? I am using R.

NB. My attempt (probably wrong and overly complicated):

1. Bootstrap with replacement 100000 times to get 100000 samples each of size 1023 from the 30000 genes. The 1023 Stability genes are excluded from the population from which the genes for each sample are drawn.
2. Use `var.test` across c for each sample vs the 1023 Stability gene list. If the test does not produce a significant value, that sample can be used in step 3. Samples that produce a significant value are discarded. This is done to satisfy the assumption of the Wilcoxon test that the variances between the two test samples are similar.
3. Do a Kolmogorov–Smirnov test (`ks.test`) for each sample brought forward vs the 1023 Stability gene list. If the test does not produce a significant value, that sample can be used in step 4. Samples that produce a significant value are discarded. This is done to satisfy the assumption of the Wilcoxon test that the distributions between the two test samples are similar.
4. Do a Wilcoxon test (`wilcox.test`) for each sample brought forward vs the 1023 Stability gene list. If the mean (or median?) p-value for these tests is significant, I can say that c is significantly greater for Stability genes than it is for non-Stability genes.
statistics • 2.5k views
modified 8.7 years ago by Qdjm1.9k • written 8.7 years ago by user1202664200
2
8.7 years ago by
Qdjm1.9k
Toronto
Qdjm1.9k wrote:

A simple approach would be just to use a Wilcoxon ranksum (i.e. Mann-Whitney U) test. R calls it the two-sample wilcox.test.

Seems like it would difficult to account for unannotated stability genes but if you think that there is an effect then it seems to me that the unannotated genes will simply dilute that effect, so you can calm your fears about inflated Type 1 error.

What I would worry about, if I were you, is whether is any relationship between copy number aberrations and Stability genes that is independent of the presumed functional connection in your study -- for example, maybe longer genes are more likely to contain copy number aberrations AND are more likely to be Stability genes but the two are otherwise unrelated to one another. You need to think of appropriate negative controls (whether experimental or statistical) to control for this.

+1 about the confounding effect of length. Do you think perhaps this could be mitigated somewhat by having the cumulative length (in bps) of the genes in the 1023 Stability gene list be equal to the cumulative length of the genes in the sampled gene list, give or take 1%?

That's comforting. You could try a ranksum or KS test to check whether there are significant differences between the two distributions.