Question

Gene set enrichment analysis using a curated gene list and cluster DE genes

0

Entering edit mode

4.7 years ago

asmariyaz23 ▴ 10

I have a curated gene list using which I would like to carry out enrichment analysis on DE genes in clusters obtained using Seurat. I first tried to do this manually using Fisher Exact test like so:

No. genes in curated list: 5840 

No. DE genes in Cluster 0 (from Seurat): 512

No. Overlap genes: 209

No. Universe: 23,000

No. Untested: 23000 - (5631+209+303) = 16857

.

5840-209=5631
512-209=303

2X2 contingency table is designed as such:

209 5631
303 16857

The odds ratio looks off in this case so I am wondering if I designed the test correctly?

Secondly, I was trying to find a package (like fsgea) in R that would let me do this kind of analysis. My idea was to use all DE genes in each cluster to be fed as a custom pathway. But I am confused about the ranked list? What should that be? Unable to figure out where the curated gene list fit into the equation. Alternatively, is there a better approach to address this issue?

RNA-Seq enrichment R • 1.9k views

ADD COMMENT • link 4.7 years ago by asmariyaz23 ▴ 10

0

Entering edit mode

I will try it this way as well, just needed clarification on 2 variables N and k.

N = Are these the total number of genes in matrix (after initial filtration in a single cell package, in my case Seurat)?

k = Here do you refer to only the DE expressed genes in the cluster of interest or the total number of genes in the cluster?

Thank you again for your insight on this.

ADD REPLY • link 4.7 years ago by asmariyaz23 ▴ 10

0

Entering edit mode

The odds ratio looks off in this case

Why do you think this ?

ADD REPLY • link 4.7 years ago by Carlo Yague 8.6k

score 1 · Answer 1 · 2019-07-24

1

Entering edit mode

4.7 years ago

Jean-Karim Heriche 27k

I think you're going about it the wrong way. If you want to know the probability of having the observed number or more curated genes in a cluster of DE genes, you can cast this as an urn problem. In the urn, you have N genes where N is the number of genes tested for differential expression, of these N genes, m are marked as curated and you draw k genes (the number of genes in the cluster of interest) out of which q are curated. So the probability of getting q or more curated genes in the cluster just by chance is given (in R) by phyper(q-1, m, N-m, k, lower.tail=FALSE)

ADD COMMENT • link 4.7 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

The hypergeometric test (urn problem) is equivalent to the corresponding one-tailed version of Fisher's exact test. It is just a different way to think about the data, as it provides the same pvalue. See with the OP's data:

> fisher.test(matrix(c(209,5631,303,16857),2,2), alternative="g")$p.value
[1] 8.277633e-15
> phyper(209-1,303+209,16857+5631,209+5631, lower.tail=FALSE)
[1] 8.277633e-15

ADD REPLY • link 4.7 years ago by Carlo Yague 8.6k

1

Entering edit mode

I know. I was trying to clarify things for the OP which seemed confused by the GSEA approach.

ADD REPLY • link 4.7 years ago by Jean-Karim Heriche 27k