We recently had to run a differential expression analysis involving RNA-seq from ~20 tumors against several controls. Our final goal was to extract the main differences between our cases and controls. While running DESeq2 on these samples we observed a relatively high variation between cases (which was expected) while the controls were quite similar.
The analysis revealed ~3000 thousands of genes differentially expressed between our two conditions. However when we looked at the read counts of these differentially expressed genes we saw that only a subset of the samples where expressing it. In other words, for most differentially expressed genes only a subset of our cases are driving the signal. We are afraid that we only uncover differentially expressed genes that we cannot qualify as "cases vs controls" but mainly due to inter-cases variability. The issue here seem to lies in the case group which is too heterogeneous (diagnostic time, drugs, ...). Unfortunately it is not possible to regenerate a more homogeneous dataset.
Our two next approaches will include:
- Batch effect correction.
- Run the analysis several times with a subset of the case samples and compare the results.
Does anyone would have a comment or solution (if it exists) to extract the main signal without looking too much at the inter-sample variability?
Thank you in advance!