How do I calculate the p-value in a gene enrichment analysis, using parametric and non-parametric methods?
2
3
Entering edit mode
8.0 years ago
es ▴ 30

I have a list of 20 QTL from an experiment involving a certain phenotype related to bone development. Of all the candidate genes within the 20 QTL intervals, 5 are associated with a certain disease in humans. I tested the set of 20 QTL to see if they are enriched for the disease-associating genes. Each of the 20 QTL has a certain length in base pairs. I took that set of 20 lengths, and randomly placed them in the mouse genome. I forced them not to overlap and to land no closer than 500,000 bases on either side. Then I took note of the genes inside each interval and counted how many matched the names contained in the disease-associating list of genes. I repeated the process 10,000 times. This provides the expectations given randomness... how many times can I expect to see disease-associating genes inside a set of 20 randomly placed QTL? I compare that to my actual observed number. then I calculate the probability.

To do that, I simply counted how many times 1 disease-associating gene appeared within each set of 20 randomly placed QTL and divide by 10,000. That gives me the frequency. I did that for 2 disease-associating genes, and 3, and so on up to 5, which is my observed value. I find that the frequency of 5's in the random data is less than 0.05. Is this appropriate?

Also, I see online some might suggest a parametric method, like Fishers' exact test, but that is supposed to be used for very small data sets, am I correct? Am I missing something there? How can I implement a parametric calculation for p?

Gene-enrichment genomics simulation • 7.7k views
2
Entering edit mode
8.0 years ago

tl;dr: It sounds like you're on the right track.

When you expect that your background distribution might be strange or otherwise poorly described, a random sampling approach like the one you're using is the go-to method. There are a number of difficulties that one should keep in mind when performing the random sampling.

1. If you're not actually sampling randomly from the genome in your wet-bench experiment, then you also shouldn't do so in the simulation. In other words, try to make the sampling match what's being done at the wet-bench as closely as possible (so if you're using an array of some sort, use its probes in deciding where to randomly sample).
2. The signal strength or GC content or some other parameter is often related to your power to detect a change in an area. If you observe that, try to incorporate that into the simulation as well.
3. Depending on the lengths of your QTLs, you might consider sampling them from a distribution (Gaussian or otherwise). This would make more sense if they might form a rough distribution, of course.
4. I'm assuming that this gene list is one that you specifically looked for. If not, then you need to account for that as well (after all, your likelihood of finding a specific change is quite different from finding a change in any group.

Regarding the actual p-value, it's just the number of times that you observe an equal or greater number of the group in question occurring in the simulation. Obviously, the more iterations the better, though 10000 is probably a good number (I may have used 10k in a paper myself, I'd have to check).

0
Entering edit mode

Even that you considered all these situations, there might be still some con-founder might be missed, right? That means, in the past decades, large number of false-positive enrichment were reported among previous huge bioinformatics and genomics studies?

1
Entering edit mode

Unknown confounders can always exist. There's no way around that in any experimental science.

0
Entering edit mode
8.0 years ago

Fisher's exact test and hyper-geometric test are the most common tests for functional enrichment.

The p-values depend upon the the size of the differentially expressed gene lists and the pre-defined gene lists. In general, larger differentially expressed gene lists (the equivalent of your QTL gene list) are better, but this strategy is already pretty sensitive. For example, you can find results with p-value < 0.05 when only a single gene in your differentially expressed gene list is included in a pre-defined gene set. So, I think this is probably sufficient, especially if you also mention your simulation results.

There have been a lot of tools developed for gene set enrichment, so I'm sure you could find something if you looked hard enough. For example, ermineJ has some implementations for rank-based methods.

0
Entering edit mode

Thanks. My data are not expression data. Online GE tools are specially designed for gene expression data, and the curated categories do not fit my needs. This is why i did my own test. I posted this question to ask how do I go about finding a p-value. I think finding the expected frequencies for 1 gene, 2 genes, 3 genes...up to my observed number makes sense but wanted some validation. Do you agree that my comparison of 5 in 20 to the listing of frequencies from the simulation is appropriate? On Fisher's test, I keep reading that it is designed for small data sets only, yet I do see it referred to often in gene enrichment studies. Can you shed some light on that? Thanks!

0
Entering edit mode

The Fisher's exact test and hypergeometric test are independent upon the sample size but depend on the size of the gene lists. I would say the opposite is true: you generally get better results with these tests with larger gene lists (for example, I would prefer to work with a list of >100 differentially expressed genes)

Either way, these are the most common tools used to address this type of question. I wouldn't be surprised if they yielded significant results, but that would depend upon the total number of known disease-associated genes. Is is possible to get a significant result with a single overlapping gene, so it is certainly theoretically possible to get a significant result with 5 genes.

0
Entering edit mode

And I read that Fisher's ET and the hypergeometric test are equivalent, so long as the FET is one-tailed. On the other hand, I also read that the hypergeometric test is based on sampling without replacement, whereas FET is not. I believe I might be reading about these tests in contexts that do no match up with my own. What would you say?

Also, can you give some advice on my use of frequencies in the random data, and comparing those to my observed values...is that a valid non-parametric p-value calculation?

0
Entering edit mode

Fisher's exact test can be either 1 or 2-tailed and is effectively synonymous with a hypergeometric test (in practice, at least, though this needn't always be the case). Regarding Fisher's test being for small datasets, that's mostly in relationship to a Chi-squared test and because computing Fisher's test by hand is a non-starter when you have more than small double-digit numbers (by-hand, you'd need to calculate a bunch of factorials, which would quickly become obscenely large). With a computer and numerical methods, Fisher's test is certainly applicable to large datasets.

0
Entering edit mode

I used an online exact fisher's test calculator at http://vassarstats.net/index.html, and it said I was using numbers that were too big. This seemed to agree with what I had read elsewhere, to boot. I will try a different tool, perhaps R.

0
Entering edit mode

Yes - try R. That should work.