Question

Can I remove the control in differential expression analysis?

0

Entering edit mode

3.0 years ago

Morgan S. ▴ 90

Hi there,

Essentially, my experimental design is control vs treatment. Cells were sorted based on fluorescence, so there are 4 different "colors" of treated cells, i.e. red, green, green+red, and blue+green+red. I am interested in how the colors differ from one another. And, I have duplicates for all colors and the control.

> colData(dds_no)
  DataFrame with 10 rows and 3 columns
  sample             color          sizeFactor
  <factor>  <factor>            <numeric>
 CTRL_1     control     0.730176128359336
 CTRL_2     control     1.12370593310441
  GFP_1     green       1.62229333835717
  GFP_3     green       0.733077520973604
 GFPRFP_1   greenred    1.24575808750345
 GFPRFP_2   greenred    1.27612350159403
 RFP_2          red         1.57975191196927
 RFP_3          red         0.518878115991023
 TRIPLE_1       rgb         0.833793868046399
 TRIPLE_3       rgb         1.01334467700869

I am wondering if it is appropriate to remove my control from downstream differential expression analysis when I want to only look at variation between treated cells? When I include the control, the PCA shows that the control samples are well separated from all treated cell types.

What I want (and have done) is removed control from the raw counts, built the deseq2 object with ~color as my design, produced results with LRT instead of Wald, and selected significant DEGs by the adj. pvalue. Then from this, I created a heatmap with my samples (still without control) to show the variation they have with certain significant genes and gene sets.

dds_no = DESeqDataSetFromMatrix(countData=countData_no,
                         colData=colData_no,
                         design=~color)
dds_LRT = DESeq(dds_no,test="LRT", reduced=~1)
res_LRT <- results(dds_LRT)
rld_LRT<- rlogTransformation(dds_LRT)
pathsLRT<-assay(rld_LRT)
df_pathLRT <- cbind(rownames(res_LRT), data.frame(res_LRT, row.names=NULL))
topTableLRT <- as.data.frame(df_pathLRT)
sigGeneListLRT <- subset(topTableLRT,  padj<=0.05)[,1]
topMatrixLRT <- pathsLRT[which(rownames(pathsLRT) %in% sigGeneListLRT),]
topMatrixPATHSLRT <- gsva(data.matrix(topMatrixLRT),
                   stress_Cao_etal_2017,
                   method="gsva",
                   min.sz=1,
                   max.sz=Inf,
                   kcdf="Gaussian",
                   mx.diff=TRUE,
                   verbose=TRUE)
heat_pathsLRT <- t(scale(t(topMatrixPATHSLRT)))
pheatmap(heat_pathsLRT, annotation_col=dfdds, fontsize_row = 8, cluster_rows = FALSE, cluster_cols =
 TRUE, main="Stress Paths", annotation_legend=FALSE)`

Please let me know if this approach is acceptable or if it truly requires I include the control.

Thanks, Morgan

expression differential deseq2 analysis rnaseq • 1.1k views

ADD COMMENT • link 3.0 years ago by Morgan S. ▴ 90

score 3 · Answer 1 · 2021-08-25

If you are interested in differences between the colours, then yes, this design is ok. Indeed, if you wish to ask "Is the change between green and control different than the change between red and control" you are implicitly ingoring the control in any disign as (control - green) - (control - red) == red - green. However, you may hav difficulties in the down stream interprestation of the results. Let say that your effect size for red - green is 2, than means that expression is 4 times as high in red as it is in green. But this could mean that red upregaultes gene expression 4 times more than green, or that it down regulates gene expression 4 times less than green, or that red up regulates expression two fold and green down regulates expression 2 fold. The control would allow you to distinguish these.

My advice would be to calculate significant genes as you have done, but calculate all your log fold changes compared to control in order interpret the results.