I have RNASeq datasets composed of multiple treatments vs multiple control batches (i.e., each treatment has its own control). All samples however come from the same parent cell line, hence I believe I should be able to use the controls from other treatments -- in order to expand my n for controls and increase the robustness of the treatment vs control DE analysis. I have checked each control sample and they are wt (by calling SNPs) -- the treatments here are CRISPR introduction of mutations into the parent cell line. What is a good way to additionally check that all these controls could indeed by grouped together, besides exploratory PCA (checking to see that treatment vs control is on PC1 and not the different control batches)? How about doing DGE analysis only on controls and checking to see that the most variable genes there are not the genes identified in treatment vs control analysis? Any other checks?
When doing differential expression, you could include batch as an additional variable in your regression model ~ condition + batch. The regression model will at the same time help to control for batch effect, and also let you test which genes are differential expressed between batches.
The problem with conducting DE tests of our control samples against each other, is that there will undoubtedly be SOME differences between them. But how many is too many? If its only 3 genes, is that too many, what if its 30? Or 300?
Probably the best solution here is instead of just using the pooled controls for a series of many DE analyses, instead encode the whole thing as a single model with one factor for treatment and one factor for "batch", where a batch refers to a pair of treatment and control, so for two treatments with 2 reps your design table might look like:
Sample Treatment Batch
1 Control batch1
2 Control batch1
3 Treat1 batch1
4 Treat1 batch1
5 Control batch2
6 Control batch2
7 Treat2 batch2
8 Treat2 batch2
And the design formula ~ 0 + Batch + Treatment. The model will then attempt to correct for any pair specific differences (that is effects that are shared between control and treatment for any particular pair, but not with other pairs).
I think this is the eventual goal, to conduct signature discovery using treatments vs controls, and correcting for batches. But the question is whether to do a pre-analysis step of asking whether lumping controls together makes sense and creates a signal beyond the treatment vs control signal (and ideally it doesnt)
Since you are adding batch to the regression formula, you can use a contrast to test which genes change between certain batches. That will probably be one of the better indicators of batch effect.
When doing differential expression, you could include batch as an additional variable in your regression model
~ condition + batch
. The regression model will at the same time help to control for batch effect, and also let you test which genes are differential expressed between batches.