Wondered if anyone could offer some advice on GGProfiler2 and a question about background control for when I use it to do a gene set enrichment analysis (background referring to a gene list for comparison, I believe). Two of the options when using this package include to use "annotated" genes, which are "genes with at least one annotation." In contrast, the other option is "known", which is a list of all the known genes in the organism (see https://rdrr.io/cran/gprofiler2/f/vignettes/gprofiler2.Rmd).
When I compare my experimental gene list with "known" as background I get many more results with ggprofiler2 (which has functions for gene ontology, TRANSFAC analysis etc.), but when I use "annotated" I receive very little or no results. My inclination is to use "known", but I'm not sure if this is reliable or there are other considerations I should be having before just going with this.
Does anyone have any advice on or experience of this?
I really appreciate any advice or help people can offer - thanks!
When I use gprofiler2 for term enrichment analysis (so differential genes towards terms from KEGG/REACTOME) I use all genes that were analysed in the DE analysis as background. As I usually use edgeR or DESeq2 this would (for edgeR) the genes that survive the filterByExpr filter and (for DESeq2) the genes that are not NA after running results so surviving the independent/outlier filtering. That having said, the option you describe defines which genes the tool consideres as background. If "annotated" then only genes that have some annotations are considered. Imagine your gene is a poorly-annotated non-coding RNA without any know functions. "Annotated" would probably ignore that gene while "known" would include it. That changes the number of total genes in the background and therefore the pvalue calculations. To be honest I never changed the defaults, therefore I use "annotated", so only genes from my custom background are actually considered that have some kinds of annotations. That is probably reasonable as unannotated genes (for this kind of analysis) do not contain any information and therefore probably should be ignored.
Note that this is not GSEA (Gene Set Enrichment Analysis) what gprofiler2 (gost function) does. It is enrichment of a list of genes (e.g. differential genes) towards functional terms. GSEA in contrast checks if your entire transcriptome show tendency to be up/downregulated as a whole for functional terms. For this you rank your genes (all genes) by a metric, e.g. fold change or pvalue and then compare the distribution of ranks of those genes overlapping the terms you check against. The question in GSEA is whether a gene set as a whole shows evidence for over- or underexpression, it does not ask whether a user-defined set of genes (e.g. differential ones) is enriched for certain terms. That means that GSEA can be significant even if not a single gene in your analysis is differential in a pairwise analysis.