Hi, i want to run gene ontology analysis with fgsea.
I have 2 groups: control and mutant, and I ran deseq2 for differential gene expression analysis. Im not sure what is the right way to choose the background genes?
for example if I choose only genes with tpm above 1?
genes that the sum od rowcounts in all the samples is > 10
all genes from the deseq2 analysis (regardless significance)
I would like to hear what is the appropriate choice?
Generally speaking you should use the entire transcriptome as the background/universe gene set for use cases on whole transcriptome analyses, and DEGs as your test set. You can adjust the background if you need to address a specific question, but that doesn't sound like the case with your experimental design.
If you adjust your background gene set without good justification, a reviewer will likely ask you to rerun the analysis when it comes to publication. I've requested this when I've peer-reviewed papers.
I'll argue that it should be the entire expressed transcriptome in most cases for reasons laid out in this answer.
Generally, I use all genes that are capable of being compared by your favorite DE toolkit, e.g. all genes with an adjusted p-value in DESeq2 output that weren't removed by independent filtering.
If you adjust your background gene set without good justification, a reviewer will likely ask you to rerun the analysis when it comes to publication. I've requested this when I've peer-reviewed papers.
This is funny, because I'd be requesting it be re-run if it wasn't adjusted.
ok, thank you so much
So if I have rna-seq from blood for example.
I have the row counts and I'm running differential gene expression analysis.
and want to use it in fgsea.
I filter the gene set I'm detecting the enrichment ,only with the genes that were in the deseq results regardless the significant?
(Is this indicating the genes that are expressed in the samples?)
For fgsea you need to rank your DE analysis results by some metric such as logFC or t-stat, so naturally you use the entire DE output list for the analysis. It asks the question whether there are general shifts for genesets towards being over- and underexpressed, on rank-space. So it is suitable for underpowered studies and general statements. Overrepresentation analysis, which all answers so far assumed, aasks whether a specific set of genes is overexpressed. And for this your background must be all genes eligable to be called DE. For DESeq2 for example that would be genes surviving the independent filtering after results(). For edgeR/limma it could be genes after filterByExpr(). It must only be the genes that could be called DE, because if e.g. for technical reasons a gene is not detected in RNA-seq there is no point adding it to the background, so imo the entire transcriptome is inappropriate.
I think the gene set you choose to be the background gene, which should be the gene set you use to run DEGs in DESeq2. So it means you should use the gene set you do filtering before differential expression analysis. Or you can look at the results of DEGseq2 analysis. not filtering with up, down, or non-significant. it should include all gene no matter Up, Down, or non-significant.
I'll argue that it should be the entire expressed transcriptome in most cases for reasons laid out in this answer.
Generally, I use all genes that are capable of being compared by your favorite DE toolkit, e.g. all genes with an adjusted p-value in DESeq2 output that weren't removed by independent filtering.
This is funny, because I'd be requesting it be re-run if it wasn't adjusted.
ok, thank you so much So if I have rna-seq from blood for example. I have the row counts and I'm running differential gene expression analysis. and want to use it in fgsea. I filter the gene set I'm detecting the enrichment ,only with the genes that were in the deseq results regardless the significant? (Is this indicating the genes that are expressed in the samples?)
For fgsea you need to rank your DE analysis results by some metric such as logFC or t-stat, so naturally you use the entire DE output list for the analysis. It asks the question whether there are general shifts for genesets towards being over- and underexpressed, on rank-space. So it is suitable for underpowered studies and general statements. Overrepresentation analysis, which all answers so far assumed, aasks whether a specific set of genes is overexpressed. And for this your background must be all genes eligable to be called DE. For DESeq2 for example that would be genes surviving the independent filtering after
results()
. For edgeR/limma it could be genes afterfilterByExpr()
. It must only be the genes that could be called DE, because if e.g. for technical reasons a gene is not detected in RNA-seq there is no point adding it to the background, so imo the entire transcriptome is inappropriate.