Question

ClusterProfiler : What is GeneRatio and BgRatio?

15

Entering edit mode

8.6 years ago

ZheFrench ▴ 590

Question is in the title.

GeneRatio is like M/N where M is the number of genes from your input list that match the GO term. But I don't see what is N ?

BgRatio is like A/B where B is all genes in database but I'm not sure what A corresponds to ... Is it the number of genes specific in the database of this GO term ?

Tell me if I'm wrong. Thanks.

clusterProfiler • 57k views

ADD COMMENT • link updated 25 days ago by chong.weelic • 0 • written 8.6 years ago by ZheFrench ▴ 590

score 29 · Answer 1 · 2018-03-01

I will give an example to explain this that helped me understand it. I also was looking for the answer and Guangchuang link helped.

Let is suppose I have a collection of genesets called : HALLMARK Now let is suppose there is a specific geneset there called: E2F_targets

BgRatio, M/N.

M = size of the geneset (eg size of the E2F_targets); (is the number of genes within that distribution that are annotated (either directly or indirectly) to the node of interest).

N = size of all of the unique genes in the collection of genesets (example the HALLMARK collection); (is the total number of genes in the background distribution (universe)

GeneRatio is k/n.

k = size of the overlap of 'a vector of gene id' you input with the specific geneset (eg E2F_targets), only unique genes; (the number of genes within that list n, which are annotated to the node.

n = size of the overlap of 'a vector of gene id' you input with all the members of the collection of genesets (eg the HALLMARK collection),only unique genes; is the size of the list of genes of interest

score 4 · Answer 2 · 2016-11-06

4

Entering edit mode

8.6 years ago

Guangchuang Yu ★ 2.6k

see https://bioconductor.org/packages/release/bioc/vignettes/DOSE/inst/doc/enrichmentAnalysis.html#over-representation-analysis

Corresponding to the formula, geneRatio is k/n.

ADD COMMENT • link 8.6 years ago by Guangchuang Yu ★ 2.6k

1

Entering edit mode

I'm a little confused about these terms.

When I;ve used the same gene set, why do my numbers of n and N change when doing gene ontology for different categories.

For example, for the same gene list for an overrepresentation test in Biological Processes for taxis GeneRatio is 209/3770 and BGRatio is 440/12553 but for Cellular Components for the term extracellular matrix, the Gene Ratio is 162/3963 and Bg Ratio is 339/13183. Shouldn't the n and N values stay the same in different GO categories?

Cheers

ADD REPLY • link 6.0 years ago by unawaz ▴ 60

0

Entering edit mode

Yeah I have the same problem. I don't really understand why the small n is changing then?

ADD REPLY • link 3.1 years ago by Arend • 0

0

Entering edit mode

I am also struggling with the same problem (i.e. n and N are changing). Have you figured it out?

ADD REPLY • link 3.1 years ago by yatzutzu • 0

0

Entering edit mode

genes <- letters[1:15]
gs_df <- data.frame("gs_name"=c(rep("genesetX", 10), rep("genesetY", 25)),
                    "entrez_gene"=c(letters[1:10], letters[2:26]))
enricher(gene = genes, TERM2GENE = gs_df, minGSSize=1)@result

               ID Description GeneRatio BgRatio      pvalue    p.adjust       qvalue                      geneID Count
genesetX genesetX    genesetX     10/15   10/26 0.000565352 0.001130704 0.0005951074         a/b/c/d/e/f/g/h/i/j    10
genesetY genesetY    genesetY     14/15   25/26 1.000000000 1.000000000 0.5263157895 b/c/d/e/f/g/h/i/j/k/l/m/n/o    14

GeneRatio = k/n

k is the overlap between your genes-of-interest and the geneset
n is the number of all unique genes-of-interest

BgRatio=M/N

M is the number of genes within each geneset
N is the number of all unique genes across all genesets (universe)

ADD REPLY • link 2.2 years ago by Rene ▴ 10

0

Entering edit mode

The link is broken, but the content was archived by the Wayback Machine: https://web.archive.org/web/20171111072829/https://bioconductor.org/packages/release/bioc/vignettes/DOSE/inst/doc/enrichmentAnalysis.html#over-representation-analysis

Or better yet, the same info at the clusterProfiler book: http://yulab-smu.top/clusterProfiler-book/chapter2.html#over-representation-analysis

ADD REPLY • link 4.1 years ago by JorgeVallejo ▴ 20

score 4 · Answer 3 · 2022-04-19

4

Entering edit mode

3.2 years ago

sarahhp ▴ 40

Or perhaps in simpler terms GeneRatio = genes of interest in the gene set / total genes of interest. Most often I use it on lists of differentially expressed genes and so GeneRatio is also the fraction of differentially expressed genes found in the gene set.

I have struggled to find the right words to explain this to others, so I hope this helps!

ADD COMMENT • link 3.2 years ago by sarahhp ▴ 40

0

Entering edit mode

what about BgRatio ?

ADD REPLY • link 16 months ago by Picasa ▴ 680

score 0 · Answer 4 · 2025-05-22

My first post. I struggled with this and I hope my post can help others. LLM/ChatGPT responses were incorrect. I manually checked the articles published and the code in ClusterProfiler to be sure. Please correct me if I'm mistaken.

N is the total number of genes in the background distribution/universe. M is the number of genes within that distribution that are annotated (either directly or indirectly) to the gene set of interest. BgRatio = M/N.

For example, in the hallmark collection for mouse, there are 50 gene sets. The Hallmark_angiogenesis gene set has 36 genes. In the hallmark collection (accessed in 2025), there are 4393 unique genes. So for hallmark_angiogenesis, BgRatio = M/N = 36/4393

The next 2 terms, n and, k requires user input and is dependent on the specific experiment data, e.g. after conducting DEG (differential expressed genes between 2 experimental conditions)

n is the size of the list of genes of interest, and k is the number of genes within that list which are annotated to each gene set. geneRatio = k/n. For example, in my data, I have 1352 DEGs, and 5 of these are in the Hallmark_angiogenesis gene set.So, k = 5 , n = 1352. and geneRatio = k/n = 0.0037. (Note: for Hallmark_angiogenesis gene set, k cannot exceed 36, because there are 36 genes annotated in that gene set, i.e k is never bigger than M. )

There are two other terms. A richFactor is defined as the ratio of input genes (e.g., DEGs) that are annotated in a term to all genes that are annotated in this term. richFactor = k/M For my angiogenesis example, richFactor = k/M = 5/36.

The fold enrichment is defined as the ratio of the frequency of input genes annotated in a term to the frequency of all genes annotated to that term, and mathematically, is Fold enrichment = GeneRatio/BgRatio.

(For RichFactor, there was a line of code in the Wu et al paper. y <- mutate(x, richFactor = Count / as.numeric(sub("/\d+", "", BgRatio))) Please do not confuse this as richFactor = count/BgRatio. It is not. BgRatio is a string "36/4393". The code deletes anything after the "/", i.e. /4393, so 36 is left behind. And so, richFactor = 5/36. Again, richFactor = k/M. Also, examine Figure 5A, notice richFactor varies between 0 - 0.15. Which fits in most cases.)

References:

https://yulab-smu.top/biomedical-knowledge-mining-book/enrichment-overview.html

clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. Wu et al 2021. DOI: 10.1016/j.xinn.2021.100141

Boyle, Elizabeth I, Shuai Weng, Jeremy Gollub, Heng Jin, David Botstein, J Michael Cherry, and Gavin Sherlock. 2004. “GO::TermFinder–open Source Software for Accessing Gene Ontology Information and Finding Significantly Enriched Gene Ontology Terms Associated with a List of Genes.” Bioinformatics (Oxford, England)20 (18): 3710–15. https://doi.org/10.1093/bioinformatics/bth456.