Question: GO enrichment analysis using R
4
gravatar for rubic
4.3 years ago by
rubic210
United States
rubic210 wrote:

Hi,

I'm trying to run a GO enrichment analysis in R. I'm using the gage package, and the GO terms are downloaded from ensembl using the biomaRt package. My problem is that I'm getting too many enriched categories and they're pretty redundant. This is after using an FDR p-value = 0.05 cutoff and only testing for GO categories with 10-50 genes in order to avoid too esoteric categories or too general ones.

I came across two solutions to this issue:

  1. It's possible to cluster GO terms using pairwise distances between them, which can be obtained by packages such as GOSim, using the function getTermSim. However, if I get a few hundreds of enriched terms which I'd like to cluster in order to remove redundancy, getTermSim takes very very long, hence is impractical.

  2. Use go-slim terms. For that I use the GSEABase package and download goslim files from geneontology.org, and use that to trim the GO terms downloaded using biomaRt. The problem here, is that at least for human data - which is what I'm analyzing, the go-slim terms seem a bit poor to me.

So my question is if there's a solution to this? some happy medium?

Is there a precomputed file of all pairwise GO term distance that can be downloaded? That'll save calling getTermSim each time I run the script.

ADD COMMENTlink modified 4.3 years ago by Guangchuang Yu2.4k • written 4.3 years ago by rubic210
3

I usually find that topGO is a good algorithm to get rid of the excessive redundancy of GO terms. It also often reports medium-sized categories as the most significant ones.

ADD REPLYlink written 4.3 years ago by Martombo2.7k
4
gravatar for Guangchuang Yu
4.3 years ago by
Guangchuang Yu2.4k
China/Guangzhou/Southern Medical University
Guangchuang Yu2.4k wrote:

Maybe you can try clusterProfiler, which can do GO enrichment analysis in either hypergeometric test or GSEA.

It can simplify the result by removing highly similar terms calculated by GOSemSim.

ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by Guangchuang Yu2.4k

But like GOSim, clusterProfiler generate a pairwise semantic distance matrix, which takes very long

ADD REPLYlink written 4.3 years ago by rubic210

should output in reasonable time.

ADD REPLYlink written 4.3 years ago by Guangchuang Yu2.4k
1
gravatar for Carlo Yague
4.3 years ago by
Carlo Yague5.7k
Canada
Carlo Yague5.7k wrote:

My problem is that I'm getting too many enriched categories and they're pretty redundant.

A third solution could be to filter out enriched GO categories based on

  • pval (be more stringent)
  • number of genes in categories (very big groups are often not very informative - yes I'm talking to you "cellular process")
  • minimal number of genes enriched in categories (sometimes, having just one gene enriched in a category is found significant, especially if the category is very small)
ADD COMMENTlink written 4.3 years ago by Carlo Yague5.7k
2

Thanks for the response. I'm actually already applying these filters - just updated that in my post.

ADD REPLYlink written 4.3 years ago by rubic210
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1936 users visited in the last hour
_