I have a study with three factors: genotype (4 different levels), treatment (2 different levels) and time (time-related condition that could be reduced to a dummy variable: 0 or 1).
I'm interested in obtaining the differentially expressed genes between the treatment, let's say "trt vs. ctrl", for each genotype, at the same time.
So, my first approach to this question was to do pairwise comparisons between "trt vs. ctrl", for each genotype, at the same time.
The code that I used:
ctr_col <- "ctrl" trt_col <- "trt" ## subset the data frame for the columns of interest sub_df <- df[,grep(paste0(ctr_col, "|", trt_col), colnames(df))] ## order columns - 1st CTR samples/replicates ... then, TRT samples/replicates ctr_cols <- grep(ctr_col, colnames(sub_df)) # select the CTR replicates first trt_cols <- grep(trt_col, colnames(sub_df)) # select the TRT replicates then order_cols <- c(ctr_cols, trt_cols) # join CTR-TRT sub_df <- sub_df[,order_cols] # subset data frame ## import data for EdgeR sub_df_edgeR <- DGEList(counts = sub_df, genes = df[,1, drop = FALSE]) # import to EdgeR keep <- filterByExpr(sub_df_edgeR) sub_df_edgeR <- sub_df_edgeR[keep, , keep.lib.sizes=FALSE] ## normalize norm_sub_df_edgeR <- calcNormFactors(sub_df_edgeR) # calculate the normalizing factors ## factor group variable indicating the order of your columns and respective group variable test_condition <- factor(c(rep("CTR", 3), rep("TRT", 3))) design <- model.matrix(~ test_condition) # model by 'test_condition' rownames(design) <- colnames(norm_sub_df_edgeR) # get the correspondence between 'test_condition' and 'colnames()' ## calculate dispersion disp_norm_sub_df_edgeR <- estimateDisp(norm_sub_df_edgeR, design, robust=TRUE) ## fit quasi-likelihood (QL) F-test and test TRT vs. CTR fitGLM <- glmQLFit(disp_norm_sub_df_edgeR, design) # fit QL F-Test TRTvsCTR_testGLM <- glmQLFTest(fitGLM, coef=2) # test TRT vs. CTR ## get the result, ie, logFC, FDR and so on... TRTvsCTR_FDR <- topTags(TRTvsCTR_testGLM, n = nrow(sub_df_edgeR$counts)) ## save only the information for genes with a 'FDR<0.05' TRTvsCTR_FDR_sign <- TRTvsCTR_FDR$table[TRTvsCTR_FDR$table$FDR<0.05,] # sub set the df to get only the genes with a FDR<0.05
The first two lines (
ctr_col <- "ctrl" and
trt_col <- "trt") are just an example. There I specify the pattern for the the level that I want for the
treatment level and for the
reference level sample names.
Apart from the
filterByExpr() code line, where I'm filtering lowly expressed genes across all samples that belong to the pairwise comparison made rather than by group, I think the code looks ok (I have 3 replicates for each genotype, condition and time). Since I'm comparing the same genotype, at the same time, with the difference relying on the treatment, I think that the differentially expressed genes that I'm getting are the down- or up-regulated genes specific to the treatment variable, for the genotype and time being compared.
I have 3 questions:
(1) Is my last sentence correct? I.e., the DEG that I'm obtaining is due to the different treatment, for a specific genotype and time;
(2) Should I go for a 3-multiple factor design (
~ genotype + time + treatment) and, then, retrieve the pairwise comparisons of interest using the
(3) Both, (1) and (2), give what I want, but (2) is better or more statistically correct?
Thank you for any help, advice or links to better posts (I checked some at Biostars),