I have RNA-seq data from the GEO database that has three groups - Control (C), Disease1 (D1), and Disease2 (D2). The patients of D1 and D2 were treated with a drug (T) and data was generated on pre and post-treatment disease groups i.e. I have 5 grouping factors - C, D1-pre, D1-post, D2-pre, D2-post. I am a bit confused while modeling this nested data to identify DE genes in limma. I have two questions in mind:
- What are DE genes in D1 and D2 compared to control, and
- Whether the drug has a differential effect on the expression of DE genes in D1 compared to D2.
I tried to make a model matrix as:
library(limma)
patient <- data.frame(
patient$patient <- factor(paste(rep(1:5, times = 5),"_",patient$diagnosis))
diagnosis = factor(rep(c("D1", "D2","Control"), each=10, length.out=25)),
treat = factor(rep(c("Pre", "Post"), each=5, times=3, length.out=25)))
patient$group <- factor(paste(patient$diagnosis,patient$treat))
design <- model.matrix(~0+group,patient)
colnames(design) <- c("C", "D1post", "D1pre", "D2post", "D2pre")
Since I am not looking for a pair-wise comparison, I am a bit confused with the modeling. I am thinking in lines, something like:
des1 <- design[,c(1:3)]
corfit <- duplicateCorrelation(eset1,des1,block=sheet1$patient)
#eset1 is the subsetted data matrix with expression levels
fit <- lmFit(eset1,des1,block=sheet1$patient,correlation=corfit$consensus)
fit.e <- eBayes(fit, trend=TRUE, robust=TRUE)
topTable(fit, coef=2)
and a similar model for D2.
I am not sure how correct these models will be to identify 'true' DEGs in post-treatment groups for both diseases.
I will be thankful if someone could please help me with modeling.