What kinds of GSEA would be most appropriate for RNA-Seq Data?
Entering edit mode
12 months ago
Noah E. ▴ 20

This is a continuation from this post I made but I figured I would ask in a different and more explicit manner given that I have a deeper understanding of my dataset:

As far as my experimental design:

I received counts data for different treatment conditions after transfection with shRNAs. The counts data from each shRNA should serve as a proxy for whether or not the gene associated with said shRNA was associated with preferential survival or death in the given treatment conditions. Lower counts with the shRNA would indicate that the gene was important in the survival of the cell line in the experimental conditions.

Thus far, for each shRNA, I calculated its log2Fold change and p-value (adjusted) using both DESeq2 and EdgeR. I then thresholded the results to select for only those shRNAs that fell below a certain p-value.

From this list, I then selected the shRNA for each gene (since there was >1 shRNA for each gene) with the highest log2Fold change (either positive or negative). I needed to select a single shRNA for each gene since the GSEA methods I planned to pursue would only allow one instance of a gene in a provided list. I chose to use the shRNA with the highest log2Fold change after p-value thresholding at the advice of my labmates.

I have other information about the shRNA data in the following dataframe as well, although this likely is not relevant to GSEA methods (all information is from EdgeR. percent_signif_shrna indicates the percentage of shRNAs that were intended to target a particular gene that fell below a specified p-value and above a certain log2Fold change). Gene Information

I now want to use tools to understand the values associated with my output. Thus far, I used PantherDB's Statistical Enrichment Test (using the Panther Pathways section) to gather some preliminary information. I used this Statistical Enrichment Test since I wanted to leverage both the gene names and the log2Fold change (since I have the quantitative data). The .txt file I put into PantherDB looked like the following tab-delineated file:

Tab-delineated file with log2Fold changes

I now want to use other types of GSEA to analyze my output and figure out the biological meanings of these values. I have identified the following potential tools: DAVID, GAGE, MSigDB, EnirchR, and RTopper.


  1. Are there any tools that anyone believes I should consider given my experimental design?

  2. Should I know anything about any of these tools in particular before I provide my values (e.g. is one considered 'better' or 'worse' for the RNA-Seq data that I intend to work with?)

  3. It seems like many of the tools only take in lists of genes and do not work with the log2Fold changes (unlike PantherDB's 'Statistical Enrichment test'). If this is the case, should I ONLY include the genes that have positive log2Fold values (AKA only those that are enriched and not those that are not downregulated)? It seems that 'enrichment' is more the target of these tests rather than downregulation so that seems to be the most appropriate recourse.

  4. Do any of these tools leverage log2Fold in any capacity? I feel as if having that additional quantitative metric (in addition to thinking about the downregulation of certain pathways) may help with my analyses. I am, however, very new in this area and understand that in some cases the list of genes alone may be the only ones worth reporting out.

rnaseq R • 617 views
Entering edit mode
12 months ago

Gene set enrichment analysis (GSEA) method developed by the Broad Institute would be my choice for pathway enrichment analysis. However, there are some points you might want to consider using this tool. The most important one is you should avoid filtering your dataset based on a statistical significance metrics like p-value or log fold change.

Since you have asked about possibility of using log2Fold in enrichment analysis, I would say yes its possible. A workaround to use that kind of data with GSEA is defining a rank variable which will rank all the genes in the list based on log2Fold and pvalue . You can find more about how to do this by reading this post which tries to show how to perform GSEA in R. Also there you should be able to find more details on different types of the enrichment analysis as well.

Entering edit mode

Thank you for the comment! The post that you linked is going to be helpful as I navigate the rest of my analyses.

For my edification: why should we avoid 'filtering out' genes a priori with a p-value? I would imagine anything with a high p-value might not be helpful for the analyses anyway and instead bog down the results.

Entering edit mode

Filtering out genes with low level of expression or those that have low level of variance from the input may reduce the statistical power of the enrichment analysis.

Here is what user guide from GSEA has stated about pre-processing:

Filtering based on expression values. For many other analytical algorithms, such as clustering, it makes sense to pre-process a dataset. For example, before running hierarchical clustering, you might remove genes that have low variance across the dataset. This prevents flat genes from driving the clustering result and improves processing time by focusing on a smaller number of interesting genes. The GSEA algorithm does not filter the expression dataset and generally does not benefit from your filtering of the expression dataset. During the analysis, genes that are poorly expressed or that have low variance across the dataset populate the middle of the ranked gene list and the use of a weighted statistic ensures that they do not contribute to a positive enrichment score. By removing such genes from your dataset, you may actually reduce the power of the statistic and processing time is rarely a factor as GSEA can easily analyze 22,000 genes with even modest processing power. However, an exception exists for RNA-seq datasets where GSEA may benefit from the removal of extremely low count genes (i.e., genes with artifactual levels of expression such that they are likely not actually expressed in any of the samples in the dataset).


Login before adding your answer.

Traffic: 1521 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6