ClusterProfiler : What is GeneRatio and BgRatio?
4
15
Entering edit mode
8.6 years ago
ZheFrench ▴ 590

Question is in the title.

GeneRatio is like M/N where M is the number of genes from your input list that match the GO term. But I don't see what is N ?

BgRatio is like A/B where B is all genes in database but I'm not sure what A corresponds to ... Is it the number of genes specific in the database of this GO term ?

Tell me if I'm wrong. Thanks.

clusterProfiler • 57k views
ADD COMMENT
29
Entering edit mode
7.3 years ago
molla.linda ▴ 290

I will give an example to explain this that helped me understand it. I also was looking for the answer and Guangchuang link helped.

Let is suppose I have a collection of genesets called : HALLMARK Now let is suppose there is a specific geneset there called: E2F_targets

BgRatio, M/N.

M = size of the geneset (eg size of the E2F_targets); (is the number of genes within that distribution that are annotated (either directly or indirectly) to the node of interest).

N = size of all of the unique genes in the collection of genesets (example the HALLMARK collection); (is the total number of genes in the background distribution (universe)

GeneRatio is k/n.

k = size of the overlap of 'a vector of gene id' you input with the specific geneset (eg E2F_targets), only unique genes; (the number of genes within that list n, which are annotated to the node.

n = size of the overlap of 'a vector of gene id' you input with all the members of the collection of genesets (eg the HALLMARK collection),only unique genes; is the size of the list of genes of interest

ADD COMMENT
4
Entering edit mode
ADD COMMENT
1
Entering edit mode

I'm a little confused about these terms.

When I;ve used the same gene set, why do my numbers of n and N change when doing gene ontology for different categories.

For example, for the same gene list for an overrepresentation test in Biological Processes for taxis GeneRatio is 209/3770 and BGRatio is 440/12553 but for Cellular Components for the term extracellular matrix, the Gene Ratio is 162/3963 and Bg Ratio is 339/13183. Shouldn't the n and N values stay the same in different GO categories?

Cheers

ADD REPLY
0
Entering edit mode

Yeah I have the same problem. I don't really understand why the small n is changing then?

ADD REPLY
0
Entering edit mode

I am also struggling with the same problem (i.e. n and N are changing). Have you figured it out?

ADD REPLY
0
Entering edit mode
genes <- letters[1:15]
gs_df <- data.frame("gs_name"=c(rep("genesetX", 10), rep("genesetY", 25)),
                    "entrez_gene"=c(letters[1:10], letters[2:26]))
enricher(gene = genes, TERM2GENE = gs_df, minGSSize=1)@result

               ID Description GeneRatio BgRatio      pvalue    p.adjust       qvalue                      geneID Count
genesetX genesetX    genesetX     10/15   10/26 0.000565352 0.001130704 0.0005951074         a/b/c/d/e/f/g/h/i/j    10
genesetY genesetY    genesetY     14/15   25/26 1.000000000 1.000000000 0.5263157895 b/c/d/e/f/g/h/i/j/k/l/m/n/o    14

GeneRatio = k/n

  • k is the overlap between your genes-of-interest and the geneset
  • n is the number of all unique genes-of-interest

BgRatio=M/N

  • M is the number of genes within each geneset
  • N is the number of all unique genes across all genesets (universe)
ADD REPLY
4
Entering edit mode
3.1 years ago
sarahhp ▴ 40

Or perhaps in simpler terms GeneRatio = genes of interest in the gene set / total genes of interest. Most often I use it on lists of differentially expressed genes and so GeneRatio is also the fraction of differentially expressed genes found in the gene set.

I have struggled to find the right words to explain this to others, so I hope this helps!

ADD COMMENT
0
Entering edit mode

what about BgRatio ?

ADD REPLY
0
Entering edit mode
10 days ago

My first post. I struggled with this and I hope my post can help others. LLM/ChatGPT responses were incorrect. I manually checked the articles published and the code in ClusterProfiler to be sure. Please correct me if I'm mistaken.

N is the total number of genes in the background distribution/universe. M is the number of genes within that distribution that are annotated (either directly or indirectly) to the gene set of interest. BgRatio = M/N.

For example, in the hallmark collection for mouse, there are 50 gene sets. The Hallmark_angiogenesis gene set has 36 genes. In the hallmark collection (accessed in 2025), there are 4393 unique genes. So for hallmark_angiogenesis, BgRatio = M/N = 36/4393

The next 2 terms, n and, k requires user input and is dependent on the specific experiment data, e.g. after conducting DEG (differential expressed genes between 2 experimental conditions)

n is the size of the list of genes of interest, and k is the number of genes within that list which are annotated to each gene set. geneRatio = k/n. For example, in my data, I have 1352 DEGs, and 5 of these are in the Hallmark_angiogenesis gene set.So, k = 5 , n = 1352. and geneRatio = k/n = 0.0037. (Note: for Hallmark_angiogenesis gene set, k cannot exceed 36, because there are 36 genes annotated in that gene set, i.e k is never bigger than M. )

There are two other terms. A richFactor is defined as the ratio of input genes (e.g., DEGs) that are annotated in a term to all genes that are annotated in this term. richFactor = k/M For my angiogenesis example, richFactor = k/M = 5/36.

The fold enrichment is defined as the ratio of the frequency of input genes annotated in a term to the frequency of all genes annotated to that term, and mathematically, is Fold enrichment = GeneRatio/BgRatio.

(For RichFactor, there was a line of code in the Wu et al paper. y <- mutate(x, richFactor = Count / as.numeric(sub("/\d+", "", BgRatio))) Please do not confuse this as richFactor = count/BgRatio. It is not. BgRatio is a string "36/4393". The code deletes anything after the "/", i.e. /4393, so 36 is left behind. And so, richFactor = 5/36. Again, richFactor = k/M. Also, examine Figure 5A, notice richFactor varies between 0 - 0.15. Which fits in most cases.)

References:

https://yulab-smu.top/biomedical-knowledge-mining-book/enrichment-overview.html

clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. Wu et al 2021. DOI: 10.1016/j.xinn.2021.100141

Boyle, Elizabeth I, Shuai Weng, Jeremy Gollub, Heng Jin, David Botstein, J Michael Cherry, and Gavin Sherlock. 2004. “GO::TermFinder–open Source Software for Accessing Gene Ontology Information and Finding Significantly Enriched Gene Ontology Terms Associated with a List of Genes.” Bioinformatics (Oxford, England)20 (18): 3710–15. https://doi.org/10.1093/bioinformatics/bth456.

ADD COMMENT

Login before adding your answer.

Traffic: 2067 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6