When doing a hypergeometric test for pathway enrichment, is there a generalized accepted way of defining the total "gene universe". I am debating two possible numbers: 1) the number of probes on the microarray that was used to generate the data in the first place, and 2) the total number of genes from the model organism used. Any thoughts on the most appropriate approach?
In enrichment analysis, using a right background database is very critical for statistical analysis. The differences in gene background definitely affect your statistical significance (P-values) and ultimately biological inference.
If you use all genes from the genome, it will give highly significance P-values. Instead, if you use, only genes that define all of your pathway categories will give more robust and reliable results.
So if you are using a microarray for your analysis, then only use the genes that are represented on microarray chip as your background. It is recommended to not use all genes from the whole genome as reference background as it will give you more significant P-values.