I’m doing differential expression analysis using DESeq2 and seeking advice on batch effects please. I have 1 experimental factor with four levels (“condition”: A,B,C,D). From PCA plot, samples separated by condition along PC1 (~ 34% of variance). There was a batch effect (2 tissue sampling dates) but only causing samples to separate vertically up PC2. No separation by batch was observed along PC1. I was therefore thinking to perform DE with batch as a covariate in the model (~batch + condition). Then use the batch-corrected variance stabilised counts via limma’s removeBatcheffect() for downstream stuff such as heatmaps/gene expression boxplots, as documented in the DESeq2 vignette.
mat <- assay(vsd) mat <- limma::removeBatcheffect(mat, vsd$batch)
However, my problem is that the batches are not evenly distributed amongst groups, and I realise this is not optimal (group-batch assignments below) but it is the data I have been given. Although possibly not completely confounded, condition D not great. I would rather not toss data if possible. So, my question is whether it is valid to perform the DE analysis and generate the batch corrected counts as I’ve described given the unbalanced design? Or as the batch effect is along PC2 not PC1, is it less risky to not batch correct than batch correct with an unbalanced design (I'm thinking probably no?)?
Any advice would be much appreciated, thanks.
condition batch1 batch2 A 3 2 B 1 4 C 1 4 D 5 0