Hi,
I am doing a functional analysis of a differential expression analysis on a non-model organism. I have one experimental group and one control group with four replicates each.
The number of differentially expressed genes I get is around 2000 (FC > |1| and adjusted p-value < 0.05) and they have about 3500 GO terms associated with them.
When I do an enricher GO over-representation analysis, I use the following code:
enricher_10 <- enricher(diff_exp_genes_names, 
                    pvalueCutoff = 0.05,
                    pAdjustMethod = "BH",
                    universe = NULL,
                    minGSSize = 10,
                    qvalueCutoff = 0.2,
                    gson = NULL,
                    TERM2GENE=go_data,
                    TERM2NAME=NA)
The diff_exp_genes_names is a vector of custom gene IDs:
 chr [1:2021] "Pr_t00014577-TA" "Pr_t00019548-TA" "Pr_t00018659-TA" ..
and my TERM2GENE file has this general format:
                GOs      query_name
        <char>          <char>
 1: GO:0005275  Pr_t00014577-TA
 2: GO:0005222  Pr_t00018659-TA
 3: GO:0005323  Pr_t00018659-TA
Here I get no statistically significant GO terms after adjustment for multiple testing, but many are statistically significant before correction, so the actual analysis is carried out.
When I do GSEA with this code:
GSEA_results_fgsea_minGSSize_10 <- GSEA(gene_list_final, 
                                    exponent = 1, 
                                    minGSSize = 10,
                                    maxGSSize = 500, 
                                    eps = 1e-10,
                                    pvalueCutoff = 0.05, 
                                    pAdjustMethod = "BH",
                                    gson = NULL, 
                                    TERM2GENE=go_data, 
                                    TERM2NAME = NA, 
                                    verbose = TRUE,
                                    seed = FALSE, 
                                    by = "fgsea")
I get plenty of statistically significant results.
The gene_list_final looks like this:
Named num [1:12510] 7.35 6.99 6.7 6.66 6.6 ...
- attr(*, "names")= chr [1:12510] "Pr_t00018699-TA" "Pr_t00007428-TA" "Pr_t00004980-TA" "Pr_t00004978-TA" ...
Question 1: Have I done the GO over-representation analysis incorrect?
Question 2: Does the GO over-representation analysis not produce any statistically significant results because the number of GO terms are too many?
Question 3: Should I use a more stringent cut-off for FC and adjusted p value to get a list of differentially expressed genes that are easier to analyze and handle? How do you analyze such large lists of genes (circa 2000 genes)?