I have RNASeq data that I am analysing using DESeq2. I have 2 genotypes (WT and mutant). The RNASeq was run in 3 seperate batches (Batch1, 2 and 3; all batches containd some WT and mutant samples). I ran DESeq using design=~genotype + batch, releveled to genotype=WT. I then tried including surrogate variables based on the batch using SVA as follows below. I get a different number of DE genes in each analysis.
Whe I run a PCA and sample distances, the samples cluster more by batch than by genotype (particulalrly the 3rd batch where I know the quality of the RNA was poor).
I'm not sure which results I should be actually using. It seems that using SVA is a better way to exclude technical variation. Am I correct? When applying SVA, should the batch be included in the original design, or just the genotype?
I appreciate any help! Thanks in advance!
dds <- estimateSizeFactors(dds)
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ genotype, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv=2)
# Number of significant surrogate variables is: 2
ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + genotype
ddssva <- DESeq(ddssva)
ressva <- results(ddssva,contrast=c("genotype","HET","WT"))