I have RNASeq datasets composed of multiple treatments vs multiple control batches (i.e., each treatment has its own control). All samples however come from the same parent cell line, hence I believe I should be able to use the controls from other treatments -- in order to expand my n for controls and increase the robustness of the treatment vs control DE analysis. I have checked each control sample and they are wt (by calling SNPs) -- the treatments here are CRISPR introduction of mutations into the parent cell line. What is a good way to additionally check that all these controls could indeed by grouped together, besides exploratory PCA (checking to see that treatment vs control is on PC1 and not the different control batches)? How about doing DGE analysis only on controls and checking to see that the most variable genes there are not the genes identified in treatment vs control analysis? Any other checks?
The problem with conducting DE tests of our control samples against each other, is that there will undoubtedly be SOME differences between them. But how many is too many? If its only 3 genes, is that too many, what if its 30? Or 300?
Probably the best solution here is instead of just using the pooled controls for a series of many DE analyses, instead encode the whole thing as a single model with one factor for treatment and one factor for "batch", where a batch refers to a pair of treatment and control, so for two treatments with 2 reps your design table might look like:
Sample Treatment Batch 1 Control batch1 2 Control batch1 3 Treat1 batch1 4 Treat1 batch1 5 Control batch2 6 Control batch2 7 Treat2 batch2 8 Treat2 batch2
And the design formula
~ 0 + Batch + Treatment. The model will then attempt to correct for any pair specific differences (that is effects that are shared between control and treatment for any particular pair, but not with other pairs).