Hi,
I need clear explanation on how GSEA is related to other functional analyses in general. I understand that it tests whether an a priori defined gene sets appear significantly different between two phenotypes, but how is this different from GO over-representation test or KEGG pathway analysis? I remember reading something about the latter two needs a pre-specified threshold whereas some analysis does not require any prior statistical threshold but only looks at relative difference between phenotype groups.
I have done DE gene analysis using DESeq2 package to get significant gene list between two groups of phenotype (in my case, it is seizure history of yes/no, from 475 total samples and involving about 30k genes). From the significant gene list, I have done GO term over-representation test (using gprofiler, and cluster profiler) and also used GAGE package to find some KEGG pathways with significant p-adjusted values.
On the GSEA website I read that Molecular Signature Database is divided into 8 major collections and sub-collections. Is there hierarchy of these collection or are there any overlaps between these collection? To be more specific, will I get different results from significant KEGG pathways by that I would from different source?
If I end up using GSEA software on desktop or Java, should I use normalized count data or raw count data? Has anyone done GSEA from RNA-seq data with a dimension as mine (475 samples and 30k genes)? If so, any advice and brief work-flow intro would be highly appreciated!
Read Tarca et al. (2013), in particular the 2nd and 3rd paragraphs from introduction should help clarify some of your doubts.