Hi,

I have a sequencing panel of approx. 1100 genes for 100 or so patients. We found that approx. 150 of these genes exhibit mutational frequencies greater than 5% in our population, and we want to get an idea of the enriched GO terms for this subset of genes.

I tried to do this using tools which perform hypergeometric testing for enrichment, and I didn't get any significant results. Pvalues were significant but not after correction, and I think there are a number of type 1 errors due to the nature of the nonparametric testing employed.

So I read up and and implemented a parametric method whereby I randomly subsample 150 genes from the total panel list, determine the ratio of each GO term compared to all terms, bootstrap this process 1000x times, calculate the probability of the observed ratio given the distribution of the randomly bootstrapped ratios, then correct for multiple testing using BH. I've checked the resulting ratios and they exhibit a normal distribution, mostly. The only ones which don't are very rare GO terms (eg., 1 associated gene) and can be easily accounted either by -log normalization or by removal from subsequent analyses.

Results look pretty good, and show enrichment for terms we are expecting, so I think it's a legitimate method. I've also tried increasing the number of bootstraps, and it doesn't have a significant impact on the mean or standard deviation for the ratio distributions, so I'm not just generating swaths of data to artificially create a significant enrichment. I've seen this method implemented elsewhere, so there must be some sort of validity to it. I was even thinking because it's a parametric method, it's likely more powerful than the hypergeometric method? I'm curious if anyone has any comments, concerns, or words of caution about the approach.

Also, given this list of significantly enriched GO terms, we'd like to generate a network diagram to help visualize the result. However, for some reason, my boss is dead set on using a specific piece of software (i.e., ClueGO) even though it won't work for this particular application. So in turn, I have to find a way to hack a solution. I've tried REVIGO, which actually worked very well, however the interconnectedness between nodes was a bit scarce, as it's based upon semantic similarity determined by SimRel. Ideally, I'd like to generate something similar to: https://www.researchgate.net/profile/Annika_Raupach/publication/265094954/figure/fig5/AS:288698752745472@1445842554015/Figure-5-ClueGOcytoscape-analysis-of-AKT-regulated-phosphoproteins-AKT2-speci-fi-c-A.png . I've been able to generate some decent networks with nodes and edges, so I can get the groundwork going, but I'm curious if there is a resource I'm missing that will specifically take GO terms as input and meaningfully generate edges between terms clustered together into specific functional groupings.

Thanks, Martin

The one in the figure is manually done with cytoscape plugins , I reckon. You can use cytoscape and play around to do the same. When you say nodes and edges you need to put weights to your graph based on your adjacency matrix being generated from your nodes. Only then you will be able to derive such kind of graphs. I believe if this 150 genes now have some kind of enrichment with enrichment value and p-value, you can always use them in the cytoscape to build up your desired network. Or you can use REVIGO and model the rscript scode which is generate online to exploit it your benefit. However it uses semantic similarity.

Yea, that seems like the most reasonable option. REVIGO has been instrumental in this regard, however I had been having some issues generating more applicable edges, which has been addressed in another comment / answer. Thanks for your input, it's much appreciated.

Make sure you control for gene size when you run your subsampling

Thanks for your response. Would you be able to clarify a bit? At this point the 5% and greater list isn't corrected for size, or atleast not known to be significantly mutated compared to sequencing error or natural mutational frequency. We did do an analysis with mutsig which does correct for this, however we only have a hundred or so samples, so the power for the analysis is as I understand it relatively low and we only reported 4-5 significantly mutated genes e.g., typical drivers like KRAS, TP53, etc.