Question

Enricher - Hyper Geometric Test Details

0

Entering edit mode

9 months ago

Manuel Sokolov Ravasqueira ▴ 100

Hi!

I have a set of 1435 genes with 445 DE and I am using enricher to analyse a specific group of pathways like this.

em <- enricher(genes, minGSSize=1,
         maxGSSize = 20,  
         universe = names(gene_list), 
         TERM2GENE=hs_kegg_df,
         pAdjustMethod = "bonferroni")

The hs_kegg_df is a set of lipid metabolism pathways. I got the results of em I found strange that none of the pathways had significant p values, so I analysed "KEGG_GLYCEROLIPID_METABOLISM" and it contains 59 genes where 2 are in my gene universe and 2 are DE.

I calculated the p value by hand by doing the hypergeometric test.

So N: Background genes -> 1435, n: DGE genes -> 445, M: Genes annotated in the geneset -> 2, S Geneset -> 59, x - Nº of DGE genes in S -> 2.

phyper(q = 2-1, m = 445, n = 1435 - 445, k = 2, lower.tail = FALSE)

The result was 0.09601563 and the enricher p value without adding the universe [EDIT] is 0.267, still not significant but different from enricher.

I wanted to ask how does the Enricher calculate the p value and if I got the formula wrong in any part?

Best Regards, Manuel

Enricher ClusterProfiler DGE ORA • 1.3k views

ADD COMMENT • link updated 9 months ago by ATpoint 82k • written 9 months ago by Manuel Sokolov Ravasqueira ▴ 100

1

Entering edit mode

Can you post the enricher output for this one, especially the GeneRatio and BgRatio columns? It usually comes down to what people think is the universe compared to what it actually is in terms of intersection with the pathway database. The formula you use appears correct, and the *Ratio columns tell which numbers the test used.

ADD REPLY • link 9 months ago by ATpoint 82k

0

Entering edit mode

Hello ATpoint, thank you for your feedback.

> length(hs_kegg_df$gs_name[hs_kegg_df$gs_name == "KEGG_GLYCEROLIPID_METABOLISM"])
[1] 59

em@result[["ID"]] [1] "KEGG_TRYPTOPHAN_METABOLISM"
[2] "KEGG_ARGININE_AND_PROLINE_METABOLISM"
[3] "KEGG_GLYCEROLIPID_METABOLISM"
[4] "KEGG_STEROID_HORMONE_BIOSYNTHESIS"
[5] "KEGG_ETHER_LIPID_METABOLISM"
[6] "KEGG_FATTY_ACID_METABOLISM"
[7] "KEGG_NICOTINATE_AND_NICOTINAMIDE_METABOLISM"
[8] "KEGG_SPHINGOLIPID_METABOLISM"
[9] "KEGG_ABC_TRANSPORTERS"
[10] "KEGG_ASCORBATE_AND_ALDARATE_METABOLISM"
[11] "KEGG_BETA_ALANINE_METABOLISM"
[12] "KEGG_BUTANOATE_METABOLISM"
[13] "KEGG_GLYCEROPHOSPHOLIPID_METABOLISM"
[14] "KEGG_GLYOXYLATE_AND_DICARBOXYLATE_METABOLISM"
[15] "KEGG_LIMONENE_AND_PINENE_DEGRADATION"
[16] "KEGG_NITROGEN_METABOLISM"
[17] "KEGG_OTHER_GLYCAN_DEGRADATION"
[18] "KEGG_SULFUR_METABOLISM"
[19] "KEGG_VALINE_LEUCINE_AND_ISOLEUCINE_DEGRADATION"
[20] "KEGG_DRUG_METABOLISM_OTHER_ENZYMES"
[21] "KEGG_LYSINE_DEGRADATION"
[22] "KEGG_GLYCOSPHINGOLIPID_BIOSYNTHESIS_LACTO_AND_NEOLACTO_SERIES" [23] "KEGG_HISTIDINE_METABOLISM"
[24] "KEGG_ARACHIDONIC_ACID_METABOLISM"
[25] "KEGG_GLYCOLYSIS_GLUCONEOGENESIS"
[26] "KEGG_O_GLYCAN_BIOSYNTHESIS"
[27] "KEGG_ONE_CARBON_POOL_BY_FOLATE"
[28] "KEGG_PROPANOATE_METABOLISM"
[29] "KEGG_PYRUVATE_METABOLISM"
[30] "KEGG_PURINE_METABOLISM"
[31] "KEGG_GLUTATHIONE_METABOLISM"
[32] "KEGG_PYRIMIDINE_METABOLISM"
[33] "KEGG_INOSITOL_PHOSPHATE_METABOLISM"

em@result[["GeneRatio"]] [1] "3/29" "3/29" "2/29" "2/29" "2/29" "2/29" "2/29" "2/29" "3/29" "1/29" "1/29" "1/29" "1/29" [14] "1/29" "1/29" "1/29" "1/29" "1/29" "1/29" "3/29" "2/29" "1/29" "1/29" "1/29" "1/29" "1/29" [27] "1/29" "1/29" "1/29" "3/29" "1/29" "2/29" "1/29"

em@result[["BgRatio"]] [1] "3/84" "4/84" "2/84" "2/84" "3/84" "3/84" "3/84" "3/84" "6/84" "1/84" "1/84" "1/84" [13] "1/84" "1/84" "1/84" "1/84" "1/84" "1/84" "1/84" "7/84" "5/84" "2/84" "2/84" "3/84" [25] "3/84" "3/84" "3/84" "3/84" "3/84" "10/84" "4/84" "11/84" "7/84"

So the [3] position contains "KEGG_GLYCEROLIPID_METABOLISM" as you can see it has 59 genes included the gene ratio is 2/29 and the Bg Ratio is 2/84.

ADD REPLY • link 9 months ago by Manuel Sokolov Ravasqueira ▴ 100

0

Entering edit mode

After reading to bk11 (thank you very much for the extensive answer). The pvalue expected from enricher is actually 0.1165 and not what I mentioned above.

Being given by phyper(2-1, 29, 84-29, 2, lower.tail = F)

84 is the number of genes from the universe that appear on the pathway set.

29 is the number of DGE genes that appear on pathway set.

The last number 2 is the number of genes from specific pathway that appear on universe.

And 2-1 (x=2) are the DGE genes that appear on specific pathway

With this I ask, let's say I am only analyzing one pathway, the p value always be most of the times 1...... , if the number of background genes is equal to the anotated genes the hypergeometric test will give 1-0=1.

This because the number of genes that appear on pathway set can be all DE this way making phyper(2-1, 29, 0, 2, lower.tail = F) which is equal to 1. But if another random pathway is added so that the it compares with one more pathway, since there is more than one pathway-set, all the DGE genes are not equal to all the genes in the universe therefore, statistically speaking, adding one more pathway would make this specific pathway significant... How can I make this test work for one specific pathway? Thank you.

ADD REPLY • link 9 months ago by Manuel Sokolov Ravasqueira ▴ 100

0

Entering edit mode

With only 1, 2 or 3 that are DE and intersect these pathways I would not do overrepresentation analysis in the first place. It's so few genes, the pvalues are pointless to me. Either look at the genes manually and decide whether they're known to be critical to the process, such as rate limiting enzymes, or do GSEA to make statements about general shifts of a pathway.

ADD REPLY • link 9 months ago by ATpoint 82k

0

Entering edit mode

Thank you @ATpoint . However since I have 1500 genes and not the entire Gene set, GSEA cannot be used right?

ADD REPLY • link 9 months ago by Manuel Sokolov Ravasqueira ▴ 100

0

Entering edit mode

You do DE analysis, this should be done on all genes from an RNA-seq experiment (maybe after prefiltering), and the DE analysis returns a results table with pvalue, logFC etc. Rank by signed pvalue or similar metrics and run GSEA.

ADD REPLY • link 9 months ago by ATpoint 82k

0

Entering edit mode

Yes, I understand ATpoint, i did exactly that, first DGE using limma voom between groups, yet since I don't have the entire gene set of possible genes I think is not recommended to use GSEA.

ADD REPLY • link 9 months ago by Manuel Sokolov Ravasqueira ▴ 100

2

Entering edit mode

I have no idea what you mean. You have DE results you can rank and there are databases such as kegg and reactome that can serve as gene set sources. That is what gsea needs. I think it would help to define what you think a geneset is, because it seems to me there is a misunderstanding.

ADD REPLY • link 9 months ago by ATpoint 82k

score 3 · Answer 1 · 2023-08-21

Here is how p-value is calculate in hypergeometric test-

enter image description here

group1 = 20+5 =25; group2 = 40+5 = 45; Overlap = 5?& total=600

#Test for over-representation (enrichment)
phyper(Overlap-1, group2, Total-group2, group1,lower.tail= FALSE)?
phyper(5-1, 45, 600-45, 25, lower.tail = F)
[1] 0.03232068

If you have run enricher using ClusterProfiler, you will get table like below.

kEGG <- enrichKEGG(gene = deGenes,
                   organism = 'hsa',
                   universe = geneUniverse,
                   pvalueCutoff = 0.05)
tkEGG <- as.data.frame(kEGG)
tkEGG<- subset(tkegg, Count>5)
tkEGG[1:5, 1:6]
ID                                          Description GeneRatio  BgRatio       pvalue     p.adjust
hsa04060 hsa04060               Cytokine-cytokine receptor interaction    22/283 121/5637 9.895248e-08 2.928993e-05
hsa04923 hsa04923                Regulation of lipolysis in adipocytes     9/283  40/5637 1.231339e-04 1.822382e-02
hsa04068 hsa04068                               FoxO signaling pathway    15/283 109/5637 3.163413e-04 2.911032e-02
hsa04360 hsa04360                                        Axon guidance    18/283 150/5637 4.636162e-04 2.911032e-02
hsa04933 hsa04933 AGE-RAGE signaling pathway in diabetic complications    13/283  90/5637 4.917284e-04 2.911032e-02

Now, let's calculate P-value for Cytokine-cytokine receptor interaction manually using phyper function-

Here,
overlap = 22
Group1 = 283
Group2 = 121
Total = 5673
phyper(Overlap-1, group2, Total-group2, group1,lower.tail= FALSE)?
phyper(22-1, 121, 5637-121, 283, lower.tail = F)
[1] 9.895248e-08