"In the case of ORA for differential expression (eg: RNA-seq), a whole
genome background is inappropriate because in any tissue, most genes
are not expressed and therefore have no chance of being classified as
DEGs. A good rule of thumb is to use a background gene list consisting
of genes detected in the assay at a level where they have a chance of
being classified as DEG"
This depends. I frequently use whole genome background and it is appropriate in many (albeit not all) cases. If you're doing a differential gene expression RNA-seq (or other type of whole-genome assay) study and are seeing one tissue type differentiating into another tissue type, I'd argue to use a whole-genome background. In this case, all genes have the capacity to be detected and if my, say, embryonic, tissue is differentiating into, say, kidney tissue, I'd want my pathway results to show that. If my background were just the union of genes that have TPM >= 1 in any samples, I may not see this effect because my background will contain genes expressed in either tissue type. This is especially true in tumorigenesis or cancer therapy studies where almost anything can happen.
On the other hand, if you're working on liver tissue and just making a small perturbation that affects the expression of a few liver enzymes, then yes, you'd want a "liver tissue" background otherwise all your enriched pathways are going to be liver pathways (since your "list of genes that could potentially change" is completely biased towards liver-specific genes whereas you should actually mostly be interested in what metabolic pathways do those enzymes belong to). With whole genome background, your results aren't wrong per se (indeed, liver pathways are enriched when you perturb the expression of liver genes -- duh!), they're just not what you're looking for.
It's very situation-dependent and the choice of background is a lot trickier than one might think: it involves thinking carefully about your biological question, what exactly you're looking for, what you want your null model to be, how you want to interpret your results, etc.
Looks great. I've been banging on about this for years!
Just adding this discussion from Gordon Smyth over at Bioconductor from this week (GSEA-related).
Scary... It's hard to believe that (up to) 95% of the researchers using an enrichment method were ignorant of the best practice regarding background correction... this somewhat looks like a "I used the tool/parameters which gave me results" issue.
Thanks for sharing -- it was a good read.