Understanding Gene Set Enrichment Analysis (GSEA) Tools in RNA-Seq
0
2
Entering edit mode
3 days ago
Noah E. ▴ 20

Hello all,

I am very new to the bioinformatics space and am still trying to figure out how to make sense of my data. Thus far, I have used DESeq2 to create results and matrices with respect to RNA-Seq data.

I have information about the genes in my dataset, their log2Fold change, L2F standard error, pvalue, p-adjusted value, and type of selection (Positive and Negative). I have made visualizations via heat maps (both normalized for count #, relative distances), tables, and volcano plots based on the above information. The data is from human cell lines.

I now want to get a better sense of the pathways that are enriched given the differential gene expression. I figured that I would start with PantherDB since it seems to be the easiest means of loading the data.

With that being said, for PantherDB:

1. Should I provide all of the genes with differential expression? Or just those with either positive or negative selection (and evaluate the two separately)? I gather the former but figured I would ask.

\

1. Would the list of genes alone be sufficient? Or should I provide any other quantitative/numeric information (e.g. Log2Fold change) to enable the program to better weight the genes/pathways? Based on the paper, it seems like the Statistical Enrichment test requires a "numerical value" but I do not know if log2Fold change alone (without p-value) would be sufficient.

\

1. Given that I want to see specific pathways, what would be the best analysis to conduct? I would imagine one of the statistical enrichment tests would be best but I wanted to check. I imagine each test would provide valuable insights - is there anything I should do to ensure I can best understand what I am looking at if I look at multiple tests?

\

1. This is a novice question but, since I have a large gene set, I seem to need an organism from the drop-down menu. Is there nothing for homo sapiens?

I also would like to better acquaint myself with some other means to look at GSEA. Does anyone have recommendations for potential tools other than MSigDB?

Thank you for your patience, \ NE

deseq2 pantherdb msigdb • 296 views
0
Entering edit mode
1. Yes, you should provide all the differentially expressed genes. We generally filter some of the genes before the pathway analysis(for example: Fold change:±1.5 ,P-adjusted-value <0.05). You can do it either way, make a separate list of up/downregulated genes or one list with LOG2FC values.

2. If you are performing statistical enrichment test, 1st column should be Gene ID and 2nd Lof2FC. You can ignore the P value here since you will provide only genes that are passing your cutoff.

3. Statistical enrichment tests should be performed if you want to see the pathways that are modulated. Explore all the annotation sets and try to see what explains your data. (Is there any particular pathway that you were expecting to be down or upregulated??)

4. What do you mean by " Is there nothing for homo sapiens?"????

5. If you are familiar with R, you can explore Gage https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-161 or https://maayanlab.cloud/Enrichr/
0
Entering edit mode

Hello,

Thank you for the reply! Disregard #4. I think I was confusing myself.

I, however, find that none of my sample tiles for the statistical enrichment test process properly. I even tried with Supplementary Data #2 provided in the paper and I get the following error: "Sequence ID and numerical value columns are required." Is there something I should know about how I should upload the file with the gene names and the values?

I will also start to look into the Gage package.

0
Entering edit mode

It should be in .txt format

0
Entering edit mode

I am uploading in a .txt format. I actually was able to get the sample file to work once I downloaded it directly from my browser rather than copying and pasting the values into a text file.

I, however, still cannot get the output from my file to run properly. It seems like it is almost parallel to the testing_file.txt file (the sample I linked above that works) but I still receive the following error: Sequence ID and numerical value columns are required.,There was an error parsing the file.

Does anyone have any ideas on how to make this work? I attached a brief image with data from the file that works (testing_file.txt) and my R output.

0
Entering edit mode

Addendum: I believe I was able to get the above code to work for the most part. It seems like I needed to make the following changes:

1. Ensure that all of the values on the left were aligned with one another (regardless of whether the first character was a '-' or a number [in the cases of a positive or negative log2Fold change, respectively).

2. Ensure that the distance between the end of the first column and the start of the second column (with the value) is not too large.

0
Entering edit mode

Yup, the second column in the second text file is not separated by TAB