I have a sequencing panel of approx. 1100 genes for 100 or so patients. We found that approx. 150 of these genes exhibit mutational frequencies greater than 5% in our population, and we want to get an idea of the enriched GO terms for this subset of genes.
I tried to do this using tools which perform hypergeometric testing for enrichment, and I didn't get any significant results. Pvalues were significant but not after correction, and I think there are a number of type 1 errors due to the nature of the nonparametric testing employed.
So I read up and and implemented a parametric method whereby I randomly subsample 150 genes from the total panel list, determine the ratio of each GO term compared to all terms, bootstrap this process 1000x times, calculate the probability of the observed ratio given the distribution of the randomly bootstrapped ratios, then correct for multiple testing using BH. I've checked the resulting ratios and they exhibit a normal distribution, mostly. The only ones which don't are very rare GO terms (eg., 1 associated gene) and can be easily accounted either by -log normalization or by removal from subsequent analyses.
Results look pretty good, and show enrichment for terms we are expecting, so I think it's a legitimate method. I've also tried increasing the number of bootstraps, and it doesn't have a significant impact on the mean or standard deviation for the ratio distributions, so I'm not just generating swaths of data to artificially create a significant enrichment. I've seen this method implemented elsewhere, so there must be some sort of validity to it. I was even thinking because it's a parametric method, it's likely more powerful than the hypergeometric method? I'm curious if anyone has any comments, concerns, or words of caution about the approach.
Also, given this list of significantly enriched GO terms, we'd like to generate a network diagram to help visualize the result. However, for some reason, my boss is dead set on using a specific piece of software (i.e., ClueGO) even though it won't work for this particular application. So in turn, I have to find a way to hack a solution. I've tried REVIGO, which actually worked very well, however the interconnectedness between nodes was a bit scarce, as it's based upon semantic similarity determined by SimRel. Ideally, I'd like to generate something similar to: https://www.researchgate.net/profile/Annika_Raupach/publication/265094954/figure/fig5/AS:288698752745472@1445842554015/Figure-5-ClueGOcytoscape-analysis-of-AKT-regulated-phosphoproteins-AKT2-speci-fi-c-A.png . I've been able to generate some decent networks with nodes and edges, so I can get the groundwork going, but I'm curious if there is a resource I'm missing that will specifically take GO terms as input and meaningfully generate edges between terms clustered together into specific functional groupings.