I'm trying to run a GO enrichment analysis in
R. I'm using the
gage package, and the GO terms are downloaded from ensembl using the
biomaRt package. My problem is that I'm getting too many enriched categories and they're pretty redundant. This is after using an FDR p-value = 0.05 cutoff and only testing for GO categories with 10-50 genes in order to avoid too esoteric categories or too general ones.
I came across two solutions to this issue:
It's possible to cluster GO terms using pairwise distances between them, which can be obtained by packages such as
GOSim, using the function
getTermSim. However, if I get a few hundreds of enriched terms which I'd like to cluster in order to remove redundancy,
getTermSimtakes very very long, hence is impractical.
Use go-slim terms. For that I use the
GSEABasepackage and download goslim files from geneontology.org, and use that to trim the GO terms downloaded using
biomaRt. The problem here, is that at least for human data - which is what I'm analyzing, the go-slim terms seem a bit poor to me.
So my question is if there's a solution to this? some happy medium?
Is there a precomputed file of all pairwise GO term distance that can be downloaded? That'll save calling
getTermSim each time I run the script.