Question

Gene Rank for GSEA Preranked analysis?

0

Entering edit mode

18 months ago

asadrahmanms • 0

Hi I want to do GSEA preranked.

I have used algorithm X (probability based algortihm) to find causal gene. Highly causal gene which causes disease gives me around 0.8, 0.9 probability (close to 1) whereas less causal gene or no causal gene gives me around close to 0 probability. I want to do GSEA based on the algorithm result (rank). I learnt that people mainly do log fold or pvalue for GSEA preranked, but my case is different. I need your suggestion that how can I test that specific algorithm result.

Question 1: How should I create a rank file for that? I show some examples: table A shows direct probability as sorted order (top genes are causal), table B shows ranking position ( high ranking value are causal). What table (table A or table B) should I put into GSEA preranked analysis or what is your suggestion on that? If you think both are wrong please kindly advise me on that how should I approach to rank it.

Question 2: I have 17000 genes, my algorithm gives me result (some probability) for 3000 genes, rest of them are 0 value. Should I test for those 3000 genes, or should I test for all of them (17,000 genes)?

enter image description here

gsea score enrichment GSEA python R • 779 views

ADD COMMENT • link updated 18 months ago by e.r.zakiev ▴ 200 • written 18 months ago by asadrahmanms • 0

score 1 · Answer 1 · 2022-10-22

For the second question there are debates since at least 10 years on how to handle the universe/background problem. Some people say one should include all genes possibly sequenced by your platform, or all genes that were even reported by your machine in your particular experiment. Others say that one should limit the background to at least the genes specific to your cell type or some other very basic, very fundamental feature of your experimental material, like if you sequenced cells with high copy number variation or only immune cells.

In my case, for example, Seurat returns a set of 2000 highly variable genes (from an initial set of 23k genes), for which it then tests differential expression. According to the proponents of background-trimming, I should use these 2000 genes as the background to run my enrichment test against, but my hunch tells me that this severs the link of our data to the factual physical nature of real genomes, which contain much more than 2000 genes.

We should also not forget about how the genesets that we check enrichment for were generated. For example, the study that contributed the FLORIO_NEOCORTEX geneset to the MSigDB, had the following background when deriving their insight:

Expressed genes were defined using a cutoff of FPKM >1. Differentially expressed genes were defined using a cutoff of p<0.01.

So from the get-go the authors of this study will most likely have a different background set of genes from yours, even though you compare your results to their geneset. Furthermore, the authors do a severe filtering of their information in order to obtain their geneset, illustrated by their own figure: Stepwise addition of exclusion parameters to the data sets of human genes

Due to varying conditions upon which the reference genesets (from MSigDB, for example) were generated, I think that we should treat them all as generated from the full genome of the organisms of interest (unless the authors of the pathway/geneset explicitly state that the dataset that they used for generating their pathway/geneset, was limited to X amount of genes (e.g. 2000 genes) and list them).

I guess it's best summarized here:

Basically, it is about how wrong is still tolerable to you. This then ties into the second problem: there seems to be little to no penalty for doing an incorrect enrichment analysis. The person sticking to accuracy hamstrings themselves. So the "beware before you publish" is more of wishful thinking of how it should be but isn't.