---------------------------------------------------------------------

Question

Can I run GSEA on a subset of genes

1

Entering edit mode

4.1 years ago

huangweiping.official ▴ 10

I get DGEs from RNA-seq, can i choose some genes to run GSEA, rather than put all expressed genes？

GSEA RNA-Seq • 4.1k views

ADD COMMENT • link updated 4.1 years ago by ATpoint 81k • written 4.1 years ago by huangweiping.official ▴ 10

0

Entering edit mode

You can, but perhaps should not. You are deliberately omitting results from the experiment, and run the risk of spurious results depending on the size of the subset.

You are better off doing a literature search on the subset of genes. (I mean you chose them for a reason, right?).

ADD REPLY • link 4.1 years ago by Barry Digby ★ 1.3k

0

Entering edit mode

Yes，I agree with you. The text below is from GSEA Official website. But I find some people use DEGs（Differentially expressed genes） or only up-regulated DEGs to run GSEA in their papers, I am so so puzzled, can I do it just like them?

---------------------------------------------------------------------

How do I filter or pre-process my dataset for GSEA? How you filter or pre-process your data depends on your study. Here are a few guidelines to consider:

Probe identifiers versus gene identifiers. Typically, your dataset contains the probe identifiers native to your microarray platform DNA chip. GSEA can analyze the probe identifiers or collapse each probe set to a gene vector, where the gene is identified by gene symbol. Collapsing the probe sets prevents multiple probes per gene from inflating the enrichment scores and facilitates the biological interpretation of analysis results. AP call filters. You can run GSEA on filtered or unfiltered data. Typically, the GSEA team runs the analysis on unfiltered data. One suggested approach is to run GSEA on the unfiltered data. If the results seem dominated by gene sets will poorly expressed genes, you might gain insight into what thresholds to use for the call filters. Expression values. The GSEA algorithm examines the differences in expression values rather than the values themselves. For example, you might have natural scale data or logged expression levels; you might have Affymetrix data or two-color ratio data. As in most data analysis methodologies, the same expression data represented in different formats may generate different analysis results. The differences are expected. GSEA cannot determine which results are "correct."

ADD REPLY • link 4.1 years ago by huangweiping.official ▴ 10

score 4 · Answer 1 · 2020-03-29

No, this is not what GSEA is doing, at least based on my understanding of the method. GSEA asks the question if genes from a gene set are enriched in your RNA-seq data towards being rather up-or downregulated.

In detail: Say you have a gene set, for example genes that are overexpressed in a certain type of cancer. You also have your RNA-seq data which you rank by significance (all genes, not a selection as you ask about). For each gene you calculate a ranking metric, e.g. -log10(pvalue) * sign(logFC). sign(logFC) is simply the direction of change, so 1 for genes with a positive-, and negative for genes with a negative fold change. Result will be a ranked list based on the significance of ever gene.

Now you feed this into GSEA. GSEA checks (typically using a permutation-based test) if the gene set genes significantly accumulate on either side of the ranking list, that means if they tend to be globally more upregulated or downregulated in your data.

In the below plot you have on the x-axis the ranked genes, e.g. those with positive ranks towards the left and negative scores towards the right. The ranking is, as said, based on your data. Each black bar represents one gene of the gene set and the position of the bars is determined by checking which rank each gene has in your ranked list. The enrichment score is now a metric that reflects how many genes accumulate at a given position of the x-axis. The curve here peaks to the far left and indicates that (given upregulated genes were ranked to the left of the x-axis) that this gene set is rather overexpressed in this dataset. GSEA also outputs a p-value for this which then helps decide if this is significant.

enter image description here

The idea behind this is the following. If you perform DGE for each gene then many genes might not be significantly different. Still, if many genes from the same pathway (which one could use as a gene set) tend to be modestly but not significantly upregulated then their cumulative effect might still cause a biologically-meaningful effect. It simply depends on the question you ask. Pairwise DGE analysis informs about individual genes while GSEA informs about global trends or the cumulative tendency of gene expression changes.

If you have now a set of genes being significant in your DGE analysis and you want to check if these are enriched for biological functions, e.g. genes being significantly upregulated, then you can use tools such as gprofiler2. In R you could use the function gprofiler2::gost(). This will then check if these genes are significantly enriched in certain pathways. gprofiler2 for example by default checks against GO terms, KEGG pathways, REACTOME pathways etc. This might be easier to interpret in some situations than a GSEA. If you have a strong phenotype like many genes are changing, then this might be the method of choice. If you have very modest changes and/or assume that the cumulative effect of the genes is biologically-meaningful rather thean the per-gene effect then GSEA might be better. It all depends on the scientific question and context. In R you could use fgsea package for GSEA which I personally find the most convenient to use.