Hi,
I have a study with three factors: genotype (4 different levels), treatment (2 different levels) and time (time-related condition that could be reduced to a dummy variable: 0 or 1).
I'm interested in obtaining the differentially expressed genes between the treatment, let's say "trt vs. ctrl", for each genotype, at the same time.
So, my first approach to this question was to do pairwise comparisons between "trt vs. ctrl", for each genotype, at the same time.
The code that I used:
ctr_col <- "ctrl"
trt_col <- "trt"
## subset the data frame for the columns of interest
sub_df <- df[,grep(paste0(ctr_col, "|", trt_col), colnames(df))]
## order columns - 1st CTR samples/replicates ... then, TRT samples/replicates
ctr_cols <- grep(ctr_col, colnames(sub_df)) # select the CTR replicates first
trt_cols <- grep(trt_col, colnames(sub_df)) # select the TRT replicates then
order_cols <- c(ctr_cols, trt_cols) # join CTR-TRT
sub_df <- sub_df[,order_cols] # subset data frame
## import data for EdgeR
sub_df_edgeR <- DGEList(counts = sub_df,
genes = df[,1, drop = FALSE]) # import to EdgeR
keep <- filterByExpr(sub_df_edgeR)
sub_df_edgeR <- sub_df_edgeR[keep, , keep.lib.sizes=FALSE]
## normalize
norm_sub_df_edgeR <- calcNormFactors(sub_df_edgeR) # calculate the normalizing factors
## factor group variable indicating the order of your columns and respective group variable
test_condition <- factor(c(rep("CTR", 3), rep("TRT", 3)))
design <- model.matrix(~ test_condition) # model by 'test_condition'
rownames(design) <- colnames(norm_sub_df_edgeR) # get the correspondence between 'test_condition' and 'colnames()'
## calculate dispersion
disp_norm_sub_df_edgeR <- estimateDisp(norm_sub_df_edgeR,
design, robust=TRUE)
## fit quasi-likelihood (QL) F-test and test TRT vs. CTR
fitGLM <- glmQLFit(disp_norm_sub_df_edgeR, design) # fit QL F-Test
TRTvsCTR_testGLM <- glmQLFTest(fitGLM, coef=2) # test TRT vs. CTR
## get the result, ie, logFC, FDR and so on...
TRTvsCTR_FDR <- topTags(TRTvsCTR_testGLM, n = nrow(sub_df_edgeR$counts))
## save only the information for genes with a 'FDR<0.05'
TRTvsCTR_FDR_sign <- TRTvsCTR_FDR$table[TRTvsCTR_FDR$table$FDR<0.05,] # sub set the df to get only the genes with a FDR<0.05
The first two lines (ctr_col <- "ctrl"
and trt_col <- "trt"
) are just an example. There I specify the pattern for the the level that I want for the treatment
level and for the reference
level sample names.
Apart from the filterByExpr()
code line, where I'm filtering lowly expressed genes across all samples that belong to the pairwise comparison made rather than by group, I think the code looks ok (I have 3 replicates for each genotype, condition and time). Since I'm comparing the same genotype, at the same time, with the difference relying on the treatment, I think that the differentially expressed genes that I'm getting are the down- or up-regulated genes specific to the treatment variable, for the genotype and time being compared.
I have 3 questions:
(1) Is my last sentence correct? I.e., the DEG that I'm obtaining is due to the different treatment, for a specific genotype and time;
(2) Should I go for a 3-multiple factor design (
~ genotype + time + treatment
) and, then, retrieve the pairwise comparisons of interest using thecontrast
option inglmQLFTest(fit, contrast=my.contrasts[,"contrast_condition"])
?(3) Both, (1) and (2), give what I want, but (2) is better or more statistically correct?
Thank you for any help, advice or links to better posts (I checked some at Biostars),
António
Thank you Devon for your kind reply!