I've been looking into a SARS-CoV-2 dataset, GSE152075. To put simply, they took RNA from infection positive and infection negative people. My aim is to look at the gene expression of a dozen genes I'm interested in to see if it is different based on infection status.
To find the DEG between positive vs negative using DESeq2, the authors used the design, sequencing_batch + infection_status. See summary below for batch numbers and infection status splits.
Batches; A: 19, B: 15, C: 17, D: 14, E: 23, F: 34, G: 16, H: 21, I: 15, J: 26, K: 10, L: 43, M: 19, N: 13, O: 28, P: 54, Q: 57, R: 14, S: 20, T: 3, U: 16
Infection status; negative = 54, positive = 413
I don't particularly see any significant clustering of the groups except for blue/purple being somewhat separated from pink on PC1 and brown colors (negative infection) from the main group (positive infection) on PC2. Is the correct interpretation that that the batch variable is very weakly linked to variance in the dataset?
Intuitively, to me having 21 batches that are uneven is probably going to make finding any differences significantly harder. However, it is also intuitive that batch effects should be considered. If my interpretation is true, is this a justification for me to exclude sequencing batch from the design?
Or should I be using the LRT function to check the difference in adj. pvals between a design of 'sequence_batch + infection_status' vs 'infection_status' for my genes of interest and deciding based on that?