My first post. I struggled with this and I hope my post can help others. LLM/ChatGPT responses were incorrect. I manually checked the articles published and the code in ClusterProfiler to be sure. Please correct me if I'm mistaken.
N is the total number of genes in the background distribution/universe. M is the number of genes within that distribution that are annotated (either directly or indirectly) to the gene set of interest. BgRatio = M/N.
For example, in the hallmark collection for mouse, there are 50 gene sets. The Hallmark_angiogenesis gene set has 36 genes. In the hallmark collection (accessed in 2025), there are 4393 unique genes. So for hallmark_angiogenesis, BgRatio = M/N = 36/4393
The next 2 terms, n and, k requires user input and is dependent on the specific experiment data, e.g. after conducting DEG (differential expressed genes between 2 experimental conditions)
n is the size of the list of genes of interest, and k is the number of genes within that list which are annotated to each gene set. geneRatio = k/n. For example, in my data, I have 1352 DEGs, and 5 of these are in the Hallmark_angiogenesis gene set.So, k = 5 , n = 1352. and geneRatio = k/n = 0.0037. (Note: for Hallmark_angiogenesis gene set, k cannot exceed 36, because there are 36 genes annotated in that gene set, i.e k is never bigger than M. )
There are two other terms. A richFactor is defined as the ratio of input genes (e.g., DEGs) that are annotated in a term to all genes that are annotated in this term. richFactor = k/M
For my angiogenesis example, richFactor = k/M = 5/36.
The fold enrichment is defined as the ratio of the frequency of input genes annotated in a term to the frequency of all genes annotated to that term, and mathematically, is Fold enrichment = GeneRatio/BgRatio.
(For RichFactor, there was a line of code in the Wu et al paper.
y <- mutate(x, richFactor = Count / as.numeric(sub("/\d+", "", BgRatio)))
Please do not confuse this as richFactor = count/BgRatio. It is not. BgRatio is a string "36/4393". The code deletes anything after the "/", i.e. /4393, so 36 is left behind. And so, richFactor = 5/36. Again, richFactor = k/M. Also, examine Figure 5A, notice richFactor varies between 0 - 0.15. Which fits in most cases.)
References:
https://yulab-smu.top/biomedical-knowledge-mining-book/enrichment-overview.html
clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. Wu et al 2021. DOI: 10.1016/j.xinn.2021.100141
Boyle, Elizabeth I, Shuai Weng, Jeremy Gollub, Heng Jin, David Botstein, J Michael Cherry, and Gavin Sherlock. 2004. “GO::TermFinder–open Source Software for Accessing Gene Ontology Information and Finding Significantly Enriched Gene Ontology Terms Associated with a List of Genes.” Bioinformatics (Oxford, England)20 (18): 3710–15. https://doi.org/10.1093/bioinformatics/bth456.
I'm a little confused about these terms.
When I;ve used the same gene set, why do my numbers of n and N change when doing gene ontology for different categories.
For example, for the same gene list for an overrepresentation test in Biological Processes for taxis GeneRatio is 209/3770 and BGRatio is 440/12553 but for Cellular Components for the term extracellular matrix, the Gene Ratio is 162/3963 and Bg Ratio is 339/13183. Shouldn't the n and N values stay the same in different GO categories?
Cheers
Yeah I have the same problem. I don't really understand why the small n is changing then?
I am also struggling with the same problem (i.e. n and N are changing). Have you figured it out?
GeneRatio = k/n
k
is the overlap between your genes-of-interest and the genesetn
is the number of all unique genes-of-interestBgRatio=M/N
M
is the number of genes within each genesetN
is the number of all unique genes across all genesets (universe)The link is broken, but the content was archived by the Wayback Machine: https://web.archive.org/web/20171111072829/https://bioconductor.org/packages/release/bioc/vignettes/DOSE/inst/doc/enrichmentAnalysis.html#over-representation-analysis
Or better yet, the same info at the clusterProfiler book: http://yulab-smu.top/clusterProfiler-book/chapter2.html#over-representation-analysis