I'm analysing at the moment gene expression data from aquatic insects, which were exposed to an insecticide in 3 different concentrations, so in total 4 treatments groups (control, low conc., medium conc., high conc.). The design is completely balanced, each group has 9 replicates.
When I run a PCA on vst-transformed counts, the plot looks like this:
It seems that the first axis explains my treatment to a certain degree... When I did the differential expression analysis with DESeq, the number of genes detected as DE somehow fitted my expectations, since the highest treatment induced most gene regulations, followed by medium and low treatment.
This is the relevant code snippet; pesticide was a factor with 4 levels
dds <- DESeqDataSetFromMatrix(countData=rawCounts, colData=metedata, design= ~ Pesticide) dds <- DESeq(dds)
Since there was certainly something else going on, I wanted to see if/how many SVs I could estimate and how the data (i.e. the PCA bi-plot) would look like if the SVs were regressed out. So I used the SVA package, which estimated 5 SVs. I included them subsequently and plotted always the PCA biplot:
What I am wondering about is that when I included all singificant SVs (5) as covariates in DESeq and perform the analysis again, then the high concentration treatment, which is still "the farest away from the control" on the first axis in the corrected PCA plot, shows the lowest number of DE genes.
This is how implemented the SVA results into DESeq
ddssva <- dds ddssva$SV1 <- svseq$sv[,1] ddssva$SV2 <- svseq$sv[,2] ddssva$SV3 <- svseq$sv[,3] ddssva$SV4 <- svseq$sv[,4] ddssva$SV5 <- svseq$sv[,5] design(ddssva) <- ~ SV1+SV2+SV3+SV4+SV5+Pesticide ddssva <- DESeq(ddssva)
I read many posts here, and most of them relate to the issue that batch correction/SVA introduces signal. I also read that in some cases true variation is lost, what appears to be the case here, but I thought this should be also reflected in the PCA then...? I first thought this would be related to high within-group variability but when I I fit the model with a numerical predictor (i.e. concentration measurements, scaled and centered), many many genes are detected as DE - with and without including the SVs as covariates in the design.
Any thoughts or recommendations how I could further explore this or hints towards relevant posts would be highly appreciated.
Thanks and all the best!