hypergeometric test for gene feature enrichment on population frequency data
1
2
Entering edit mode
4 months ago
code_evo ▴ 30

I have a data frame that represents the frequency of gene features in the background population, and several sub populations. For each sub-population, I want to do an enrichment analysis in R with correction for multiple comparisons, to identify significantly enriched gene features in each sub population (separately), relative to the background frequency for that gene feature. (i.e. are there gene features that are significantly over-represented in population 1 compared to the background? How about pop 2? etc.)

The intended result being a new data frame, with all gene features as the fist column, and three additional columns representing the enrichment p-values for each sub-population.

It seems like a hypergeometric \ fishers test would be appropriate here, but I'm unsure of how to implement it across this structure where frequencies are already computed and how to account for multiple comparisons.

In the example below each row represents a gene feature of interest (named in the "Gene_feature" col), where "Freq_background" represents the frequency of that gene feature across a whole population, and each data column (Freq_1, Freq_2, Freq_3...) represents the frequency of the gene feature in a sub-population of interest.

#generate example data
make_names <- function(n = 500) {
do.call(paste0, replicate(4, sample(LETTERS, n, TRUE), FALSE))
}
Gene_feature<- make_names(10)
Freq_background<- replicate(10, runif(n=1, min=1e-12, max=.9999999999))
Freq_pop1<- replicate(10, runif(n=1, min=1e-12, max=.9999999999))
Freq_pop2<- replicate(10, runif(n=1, min=1e-12, max=.9999999999))
Freq_pop3<- replicate(10, runif(n=1, min=1e-12, max=.9999999999))
df<- as.data.frame(cbind(Gene_feature, Freq_background, Freq_pop1, Freq_pop2, Freq_pop3))

| Gene_feature | Freq_background | Freq_1          | Freq_2          | Freq_3          |
| :----------- | :-------------: | :-------------: | :-------------: | :-------------: |
|PTPA          |0.613762008430914|0.791384656368987|0.978364152503174|0.466154247010611|
|FOKS          |0.517032054201702|0.873722912889272|0.866081856570345|0.385693898852212|
|AQPI          |0.708481218362451|0.216534708838529|0.901803070356394|0.976621562051737|


.. Intended output:

| Gene_feature | Freq_1_pval  | Freq_2_pval  | Freq_3_pval  |
| :----------- | :----------: | :----------: | :----------: |
|PTPA          |0.009         |0.001         |0.087         |
|FOKS          |0.599         |0.079         |0.999         |
|AQPI          |0.870         |0.998         |0.790         |


..

enrichment hypergeometric R phyper fisher • 284 views
0
Entering edit mode

check out SKAT-O. If I understand your question, SKAT-O is a pretty strong implementation for the type of test you describe.

essentially you'd be running the "burden" test on genomic elements instead of single variants... https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3415556/

0
Entering edit mode
4 months ago
biomon ▴ 60

I'm not sure what your question is. However, I think if you want to compare multiple groups it may be better to do a multivariate hypergeometric test instead of a hypergeometric test.