Question

Gene Set Comparison Without Expression Data

0

Entering edit mode

9 months ago

Eliza ▴ 30

I have been looking all over the web to find some answers to my problem but unfortunately, I was unsuccessful. I wish to determine whether an a priori defined set of genes in my case genes associated with Epidermolysis bullosa shows statistically significant, concordant differences between two biological states (e.g. phenotypes).

I have a list of genes from my own experiment and a predefined list of genes associated with Epidermolysis bullosa, and want to find out if the genes of Epidermolysis bullosa are overrepresented in my list.

Here is an article about GSEA, which is related to what I want to do: https://www.pnas.org/doi/epdf/10.1073/pnas.0506580102 enter image description here

so I have the following data: I'm researching patients with severe or mild disease; in this research, I have found a list of 200 genes after conducting a SKAT test in R (https://www.rdocumentation.org/packages/SKAT/versions/2.2.4/topics/SKAT) with a p-value <0.05 ( which I presume can cause a severe disease instead of mild ) I want to focus on this 200 significant genes that can perhaps affect the disease level, this genes can be -> the L list as described in the article above. Also, the S list of predefined genes ( as is called in the article above )is associated with Epidermolysis bullosa disease. and I want to know if the genes from the Epidermolysis bullosa are overrepresented in my L list. After reading the article and many more resources, I understood that to perform GSEA (https://www.gsea-msigdb.org/gsea/index.jsp), I had to have gene expression data which i don't have .

so I wish to understand if there is a tool that could do a gene set comparison without having the expression data whether it's in R or Python that uses my gene list and predefined list ( in my case associated with Epidermolysis bullosa) ,thank you :)

gene-set-enrichment-analysis R • 1.2k views

ADD COMMENT • link updated 8 months ago by Papyrus ★ 2.9k • written 9 months ago by Eliza ▴ 30

score 2 · Accepted Answer · 2023-07-10

2

Entering edit mode

9 months ago

Papyrus ★ 2.9k

You are looking for tools which perform "ORA" (over-representation analyses), such as, in R: clusterProfiler, goseq, the web server EnricheR... Many of these, such as clusterProfiler, goseq, let you input custom "gene sets" (collections of genes such as your "genes associated with Epidermolysis bullosa"). These tools simply check if genes from some "gene set" appear more often than chance in your set of genes of interest.

Nonetheless, for a single set as you have, you can directly do a Fisher's test to check for enrichment:

N <- 10000 # this should be the number of genes in your "background/universe" (probably all of the genes for which you perform the SKAT test; i.e., all the genes which had a chance of appearing as significant)
K <- 500 # this should be the number predefined list of genes associated with Epidermolysis bullosa, filtered for those appearing in your "background/universe"
n <- 200 # this should be the number of your 200 genes with SKAT p < 0.05
k <- 100 # this should be the size of the intersection betwen your 200 genes and the filtered predefined list of genes associated with Epidermolysis bullosa 

mt <- matrix(c(k,K-k,n-k,N-K-(n-k)), nrow=2)
test <- fisher.test(mt, alternative='greater')
test

ADD COMMENT • link 9 months ago by Papyrus ★ 2.9k

0

Entering edit mode

Papyrus thank you , and if I have more then one predefined set of genes I want to test besides Epidermolysis bullosa, to check if genes from my experiment are enriched with thm the tools you listed would work ?

ADD REPLY • link 9 months ago by Eliza ▴ 30

0

Entering edit mode

Yes, I mean you could always do a loop and use multiple Fisher's/hypergeometric tests, but these tools are specialized for that task and provide a lot of functionalities. Be sure to check the vignettes. For example, the enricher function in clusterProfiler let's you do this if you construct a custom TERM2GENE annotation:

library(clusterProfiler)

TERM2GENE = data.frame(
  term = c(rep("geneset_1",8),rep("geneset_2",5)),
  gene = sample(letters,13)
)

res <- enricher(gene = letters[1:5],
         minGSSize = 0,
         maxGSSize = Inf,
         universe = letters,
         TERM2GENE = TERM2GENE,
         pvalueCutoff = 1,
         qvalueCutoff = 1)

ADD REPLY • link 9 months ago by Papyrus ★ 2.9k

0

Entering edit mode

Papyrus thank you very much for your replies, just one clarification in the function you provided :

res <- enricher(gene = letters[1:5], minGSSize = 0, maxGSSize = Inf, universe = letters, TERM2GENE = TERM2GENE, pvalueCutoff = 1, qvalueCutoff = 1)

here the gene=letters are the genes with pvlues <0.05? and universe in the N ? as here : N <- 10000 # This should be the number of genes in your "background/universe" (probably all of the genes for which you perform the SKAT test; i.e., all the genes which had a chance of appearing as significant) n <- 200 # This should be the number of your 200 genes with SKAT p < 0.05, as I uderstant it also performs hypergeometric test?

ADD REPLY • link 9 months ago by Eliza ▴ 30

1

Entering edit mode

Yes! gene are your significant genes (your genes of interest, n) and universe are all the genes in your analysis (all the genes which had a change of appearing as significant, N). And yes, it will perform hypergeometric tests for all the pathways that are described in TERM2GENE.

ADD REPLY • link 9 months ago by Papyrus ★ 2.9k

0

Entering edit mode

Papyrus thank you very much , I run the analysis and i was wondering i got the following results :

eczema got a segnificant p vlaue but when i run it for example without psoriases the pvablue becomes not senificant >0.05 , cant seem to understand why ?

ADD REPLY • link 8 months ago by Eliza ▴ 30

0

Entering edit mode

What do you mean by running it "without psoriasis"?, I don't understand. Also, if the "BgRatio" column means what I think it means, almost all of the 5426 genes you profiled are classified as belonging to "eczema", so the enrichment is probably not very strong.

ADD REPLY • link 8 months ago by Papyrus ★ 2.9k

0

Entering edit mode

Papyrus I tested if genes from my experiment are enriched with genes from various skin diseases as I understand the eczema p-value is <0.05 does it mean that my genes are enriched with genes associated with eczema ? I also did the regulat hypergeometric test and got a segnificant result : enter image description here

ADD REPLY • link 8 months ago by Eliza ▴ 30

1

Entering edit mode

If would seem that way, if the provided numbers are correct. Are all the 5157 genes associated with eczema present in your background of 8873 genes?

ADD REPLY • link 8 months ago by Papyrus ★ 2.9k

0

Entering edit mode

Papyrus yes this is the intersection intersect(eczema_id$SYMBOL,genes$Gene.refGene_ANNOVAR) which is the K in the test

ADD REPLY • link 8 months ago by Eliza ▴ 30

0

Entering edit mode

Hmm, however I'm not quite sure how, if you have 5157 genes (K) intersecting your background (N), in the enricher results you show, it states that BgRatio (which I think is K/N) is 5320/5426. Check that you're imputing the same data into enricher that you use in the separate Fisher test, and that you're doing everything correctly (no duplicated genes, etc.). The numbers shown by enricher could be lower than in your data because it may filter everything for those genes present in the pathways you submit, but I don't think they should be higher.

ADD REPLY • link 8 months ago by Papyrus ★ 2.9k