I have run a likelihood ratio test on a three-condition comparison and received a very large number of significantly differentially expressed genes (>8,000). After performing over-representation analysis on all DE genes, no GO terms were significantly over-represented. I'm assuming this is because the significant list is about half of the background, or all genes tested for differential expression. Would it be incorrect statistical analysis to subset the top results (say, the top 1,000 DE genes by adjusted p-value) and perform over-representation analysis on that subset? It seems incorrect to take only a portion of significant results, but as a student with limited statistical knowledge I wanted to check.
Thank you for the insight. My issue is that the likelihood ratio test I used only assigns an adjusted p-value that can be used to filter for significance (there is no logFC associated with the genes as it is a multi-group test). Even if I make the adjusted p-value ridiculously low, there is still a very large number of differentially expressed genes.
Is there a specific reason you have to use an LRT? Regardless, you can still use
glmTreat
with aglmFit
model and get a modified LRT against the threshold in edgeR. See the glmTreat details for more info.