Hello everyone,
I’m doing differential expression analysis using DESeq2 of 6 heart cells(iPSC-CM) that were either treated with iron or not (treatment). The issue is that these 6 samples were taken from 3 different individuals and I am not interested in differences between individuals(sample), but the effect of iron on these heart cells. From the PCA plot, however, it seems that the samples are not separated by the intervention(FE vs AA), but rather by the sample type(https://ibb.co/fvH0xmh) despite that I have controlled for the samples in my design(see code below). From my simple understanding of DESeq2, the code I wrote indicates that the effect of treatment will be measured correcting for the sample type and that's what I want. Even if I switch the design, I still get the same clustering pattern. Am I missing something? I don't think it's because of the data because another study using the same dataset found clustering based on treatment and not individuals.
Thank you very much in advance for your help!
The code
#Overview of the count data
df_io[,1:6]
X273.AA X273.FE X15.AA X15.FE X2.AA X2.FE
DDX11L1 0 0 0 0 0 0
WASH7P 46 56 64 79 69 68
MIR6859-1 2 0 2 0 2 2
MIR1302-2 0 0 0 0 0 5
FAM138A 0 0 0 0 0 0
OR4G4P 0 0 0 0 0 0
Making my metadata
treatment<-as.factor(c("AA","FE","AA",
"FE","AA","FE")) # AA= control, FE= intervention
cell_type<- as.factor(c("HiPSC273","HiPSC273","HiPSC15",
"HiPSC15","HiPSC2","HiPSC2")) # indicating the sample type(from which individual it came from)
metadata<-data.frame(treatment,cell_type)
# matching the columnames of count data to rownames of metadata
rownames(metadata)<-colnames(df_io)
#### checking ####
all(rownames(metadata)==colnames(df_io)) # was true
Now making the model and extracting normalized counts to make PCA plot
df_deseq2_ijzer<-DESeqDataSetFromMatrix(countData = df_io,
colData = metadata,
design = ~cell_type+treatment)
dd_normalizated<-estimateSizeFactors(df_deseq2_ijzer2)
normalized_counts_ijzer<-counts(dd_normalizated,normalized=T)
qc_log_ijzer2<-varianceStabilizingTransformation(dd_normalizated,blind = T)
Making the PCA plot
plotPCA(qc_log_ijzer2,intgroup=c("treatment","cell_type"))+theme_bw()
Thank you very much for your help! I really appreciate it. I did what you said and I got a better PCA plot https://ibb.co/f4zKYDD. However, the RNA-Seq dataset I am using was already used for publication(https://www.sciencedirect.com/science/article/pii/S2211124720308676#bib11) and the PCA plot they reported is different than the one I got (https://ibb.co/RY7HG3s) with also different differentially expressed genes. This makes me wonder whether removing batch effects is the same as adjusting for them since removing the batch effects using limma resulted in a different PCA plot.
Getting the vst or rlog normalized counts using your formula will not output counts that are controlled for batch effect. However, the regression model for differential expression will have batch correctly controlled for. What @ATpoint is showing you how to do is make a PCA plot where batch effect is controlled for, for in appearance it is closer to what your regression model looks like.
Thank you very much for your reply. I think I got what you mean. However, how would you explain the discrepancies between the PCA plot I produced following ATpoint suggestion and the PCA plot produced from the original paper, knowing that they also used DESeq2 for RNA-seq analysis? I was expecting a similar PCA plot and a similar number of DEGs, but that was not the case.
You need the exact code they used and you need to know which data they used for the PCA plus on which genes it is based. Otherwise you cannot make any statements on why it is dissimilar.