I have a question about filtering genes detected in an RNA-Seq experiment for subsequent enrichment analysis.
I know it's common to use a p-value threshold in combination with a fold-change cutoff to define your gene set. However, with my data I tend to find that I have a lot of genes which have quite high fold-changes but with low overall expression in my dataset. By contrast, I have quite a few DEGs with modest fold-changes but high overall expression levels.
My question is whether it is common to use absolute expression levels (eg group-wise maximum counts) as a criterion for defining gene lists for enrichment/pathway analyses, prioritising genes with high expression levels over those with low expression. Or if it's not commonly done, is there a good reason not to?
This would look something like the following:
1) define p-value and log2 fold-change cutoff (ie p<0.01, log2foldchange > 0.5 in either direction) 2) rank genes which pass according to maximum observed expression in any condition 3) take the top X amount of genes (500 perhaps) and enrich these
My reasoning is that I imagine genes with low expression are less likely to be relevant to the function of my cells than moderately/highly expressed genes, and due to their low read counts are also more susceptible to noise, so are more likely to be spurious hits which don't contain useful information about transcriptional/pathway-level changes. For instance, if gene A has 30,000 counts in the control and 60,000 counts in the test condition, whereas gene B has 30 counts in the control 120 counts in the treated groups, I'm more likely to believe that gene A represents a real and relevant change than gene B, even though the fold-change for gene B is double that of gene A.
Does this make sense statistically or am I setting myself up for a biased or otherwise flawed analysis by doing this?