I have 2 experimental condition and each condition has 18 samples. However, they are sequenced in 3 batches. 1 st batch 6 pair (disease & control), 2nd batch 6 pair (disease & control), 3rd batch 6 pair (disease & control).
Each batch done by different people over different times so I believe we have batch effect. As I am using DESeq2 for my analysis I set my model design as “Batch+Type”. I used removeBatchEffect from limma package and plotted PCA & did hierarchical clustering using spermann correlation (I believe it is better to use Spearman rather than the Pearson??) to see which samples are outliers. Another issue I am facing is 1st & 2nd batches are 76 bp paired end, however the last batch is 101 bp paired end.I did not add this into the design separately.
I would like to do differential gene expression analysis and WGCA. So I have multiple questions:
1- Should I use rlog() for differential gene expression analysis and getVarianceStabilizedData() for WGCNA? or can I use same normalization for both? If yes, which one do you suggest ?
2- Should I remove any samples according to these PCA plots & dendogram or just do analysis as I already included batch effect into the model design as "~Batch+Type". Or is it like for DGE analysis I do not to remove any sample but for WGCNA I should remove ? For both normalization type it looks like I have to remove the Patient-Sph4 as it look has distance from others in PCA plot also (the one at the middletop)
I really appreciate if you give me some insight,
Thanks in advance,
ddsMat.pre<- DESeqDataSetFromMatrix(seMat.pre, sample.table.ERSvsPT.pre, design=~Batch+Type) dds.pre<-DESeq(ddsMat.pre) dds.pre <- dds.pre[ rowSums(counts(dds.pre)) > 1, ] datExpr0<- assay(dds.pre) gsg = goodSamplesGenes(datExpr0, verbose = 5); gsg$allOK
rld <- rlog(dds.pre, blind= FALSE) rlogMat <- assay(rld) save(rld, file="RLD_Pre_Type+Batch") design=model.matrix(~sample.table.ERSvsPT.pre$Type) removedbatch.rld=removeBatchEffect(rlogMat, batch=sample.table.ERSvsPT.pre$Batch, design= design ) write.csv(removedbatch.rld, file="removedbatch.rld_36patient") rld.removed<-rld assay(rld.removed)<-removedbatch.rld par(mfrow = c(1,2)); cex1 = 1.5; plotPCA(rld.batch, intgroup=c("Type", "Batch")) plotPCA(rld.removed, intgroup=c("Type", "Batch")) colData(rld.removed)$Type<-as.factor(colData(rld.removed)$Type) colData(rld.removed)$Batch<-as.factor(colData(rld.removed)$Batch) data.removed<-plotPCA(rld.removed, intgroup=c("Batch","Type"), returnData=TRUE) data.removed$name<-colData(rld.removed)$Patient percentVar.removed <- round(100 * attr(data.removed, "percentVar")) p2<-ggplot(data.removed, aes(PC1, PC2, color=Batch, shape=Type, label= name)) + geom_text(size=5,hjust = 0, nudge_x = 0.05) + geom_point(size=3) + xlab(paste0("PC1: ",percentVar.removed,"% variance")) + ylab(paste0("PC2: ",percentVar.removed,"% variance")) + ggtitle("PCA of RLD_pre_Type+Batch ") + coord_fixed() mat.rld.removed<- t(assay(rld.removed)) rownames(mat.rld.removed) <- paste( rld.removed$Written.ID) plot(hcluster(dist(mat.rld.removed), method= "spearman", link="average" ), cex=1.5, cex.main=2, main= "Hierarchical Clustering of RLD Normalized Count Data ") dev.copy(png, "Sample clustering (hcluster) of RLD_Pre_Type+Batch.png") dev.off()
vsd = getVarianceStabilizedData(dds.pre) save(vsd, file="VSD_Pre_Type+Batch") design=model.matrix(~sample.table.ERSvsPT.pre$Type) vsd.removed=removeBatchEffect(vsd, batch=sample.table.ERSvsPT.pre$Batch, design= design ) colnames(vsd.removed) <- rld.removed$Written.ID plot(hcluster(dist(t(vsd.removed)), method= "spearman", link="average" ), cex=1.5, cex.main=2, main= "Hierarchical Clustering of VSD Normalized Count Data ") dev.copy(png, "Sample clustering (hcluster) of vsd_Pre_Type+Batch.png") dev.off()