Choosing A Baseline Set Of Genes For Go Enrichment Analysis
2
2
Entering edit mode
11.2 years ago
Abhi ★ 1.6k

Hi Guys

I have a list(1000's) of mouse genes that I am interested to test for any GO term enrichment.

I am wondering would it be Ok to use the whole set of mouse genes as a baseline to test for enrichment of GO terms? OR randomly subsample same number of genes as in the test case

Thanks! -Abhi

expression gene-expression enrichment • 4.2k views
ADD COMMENT
2
Entering edit mode
11.2 years ago
Michele Busby ★ 2.2k

I think the answer depends on how you came to get your list of 1000 genes. For example, in microarray experiments you would have a chip with e.g. 30 thousand genes on it and maybe 1000 of the genes would be called differentially expressed. Your background list should contain only those 30K genes but not, for example, 50 genes that are not on the chip because they were not measured. For RNA Seq it is harder because there are inherent biases in each gene is measured (e.g. somewhat described in the GOSeq paper) and some people use complicated statistical tests on top of that which can add more bias.

If there are no significant biases in how you obtained your 1000 genes then I think using the entire subset would give you the clearest signal because it would contain the most information. I could imagine findings that are significant in the whole set not being significant in random subsets just because of lack of statistical power to find them with smaller numbers. Some of the categories have very few genes and high statistical noise.

If there are biases in how you obtained your 1000 genes then you need to account for them, which could be as simple as using an appropriate background gene list or a big huge bioinformatics headache depending on the experiment.

That's my opinion, anyway!

Good luck! Michele

ADD COMMENT
0
Entering edit mode

Thanks Michele. The 1000 gene list this time is indeed coming from a microarray chip. Just so I get it right my baseline in this case should only be genes that were supposed to be measured by the chip right ?

ADD REPLY
0
Entering edit mode

Yes. That's how I was taught to do it.

ADD REPLY
0
Entering edit mode
11.2 years ago

My gene lists are always associated with a value, either p-value or enrichment or some other factor. So, I use this value to take the top lets say 500, 1000 or 3000 genes depending on the items in the list.

But even if you take a very large list, it shouldn't matter, as most of the tools like David, display the top enriched categories also pointing at how many genes make up that category in the users list and the database.

Best thing would be to sample twice-thrice and test, if you are getting to varying results, then better is to have a cut-off and run the GO analysis on that list.

ADD COMMENT
0
Entering edit mode

Thanks Sukhdeep for another perspective

ADD REPLY

Login before adding your answer.

Traffic: 3401 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6