I have 2 experimental condition and each condition has 18 samples. However, they are sequenced in 3 batches. 1 st batch 6 pair (disease & control), 2nd batch 6 pair (disease & control), 3rd batch 6 pair (disease & control).
Each batch done by different people over different times so I believe we have batch effect. As I am using DESeq2 for my analysis I set my model design as “Batch+Type”. I used removeBatchEffect from limma package and plotted PCA & did hierarchical clustering using spermann correlation to see which samples are outliers.(I attached the graphs below). I also used an alternative PCA &plotting (prcomp() & fviz_pca_ind) which actually confused me. The ggplot & fviz_pca_ind graphs look different from each other. Which one should I trust?
Another issue I am facing is 1st & 2nd batches are 76 bp paired end, however the last batch is 101 bp paired end.I did not add this into the design separately. Should I add it to the model design ? If so, how should I desing ? or Should I trim all fastq files to 76 bp to have same size ? What do you suggest?
I would like to do differential gene expression analysis and WGCA. So Should I remove any samples according to these PCA plots or just do analysis as I already included batch effect into the model design as "~Batch+Type"
I really appreciate if you give me some insight,
Thanks in advance,
ddsMat.pre<- DESeqDataSetFromMatrix(seMat.pre, sample.table.ERSvsPT.pre, design=~Batch+Type)
dds.pre <- dds.pre[ rowSums(counts(dds.pre)) > 1, ]
rld.batch <- rlog(dds.pre, blind= FALSE)
rlogMat.batch <- assay(rld.batch)
Removing batch effect
removedbatch=removeBatchEffect(rlogMat.batch, batch=sample.table.ERSvsPT.pre$Batch, design= design )
Plotting the PCA
before removing batch effect:
data.batch<-plotPCA(rld.batch, intgroup=c("Batch","Type"), returnData=TRUE)
percentVar.batch <- round(100 * attr(data.batch, "percentVar"))
ggplot(data.batch, aes(PC1, PC2, color=Batch, shape=Type, label= name)) + geom_text(size=6,hjust = 0, nudge_x = 0.05) + geom_point(size=2) + xlab(paste0("PC1: ",percentVar.batch,"% variance")) + ylab(paste0("PC2: ",percentVar.batch,"% variance")) + ggtitle("PCA of RLD_Type+Batch ") + coord_fixed()
after removing batch effect:
data.removed<-plotPCA(rld.removed, intgroup=c("Batch","Type"), returnData=TRUE)
percentVar.removed <- round(100 * attr(data.removed, "percentVar"))
ggplot(data.removed, aes(PC1, PC2, color=Batch, shape=Type, label= name)) + geom_text(size=6,hjust = 0, nudge_x = 0.05) + geom_point(size=3) + xlab(paste0("PC1: ",percentVar.removed,"% variance")) + ylab(paste0("PC2: ",percentVar.removed,"% variance")) + ggtitle("PCA after removeBatchEffect ") + coord_fixed()
Alternative PCA method : prcomp() & fviz_pca_ind
before.pca <- prcomp(t(rlogMat.batch), scale=T)
after.pca <- prcomp(t(removedbatch), scale=T)
quali.sup <- as.factor(colData(rld.readlength)$Type)
fviz_pca_ind(before.pca,habillage = quali.sup, addEllipses = TRUE, ellipse.level = 0.75) + theme_minimal()
fviz_pca_ind(after.pca,habillage = quali.sup, addEllipses = TRUE, ellipse.level = 0.68) + theme_minimal()