Question

Which is the best approach to get DEG across one condition, in a three-factor design experiment?

2

Entering edit mode

3.8 years ago

antonioggsousa 3.2k

Hi,

I have a study with three factors: genotype (4 different levels), treatment (2 different levels) and time (time-related condition that could be reduced to a dummy variable: 0 or 1).

I'm interested in obtaining the differentially expressed genes between the treatment, let's say "trt vs. ctrl", for each genotype, at the same time.

So, my first approach to this question was to do pairwise comparisons between "trt vs. ctrl", for each genotype, at the same time.

The code that I used:

ctr_col <- "ctrl"
trt_col <- "trt"

## subset the data frame for the columns of interest
sub_df <- df[,grep(paste0(ctr_col, "|", trt_col), colnames(df))]

## order columns - 1st CTR samples/replicates ... then, TRT samples/replicates
ctr_cols <- grep(ctr_col, colnames(sub_df)) # select the CTR replicates first
trt_cols <- grep(trt_col, colnames(sub_df)) # select the TRT replicates then
order_cols <- c(ctr_cols, trt_cols) # join CTR-TRT
sub_df <- sub_df[,order_cols] # subset data frame  

## import data for EdgeR
sub_df_edgeR <- DGEList(counts = sub_df, 
                       genes = df[,1, drop = FALSE]) # import to EdgeR

keep <- filterByExpr(sub_df_edgeR)
sub_df_edgeR <- sub_df_edgeR[keep, , keep.lib.sizes=FALSE]

## normalize      
norm_sub_df_edgeR <- calcNormFactors(sub_df_edgeR) # calculate the normalizing factors

## factor group variable indicating the order of your columns and respective group variable
test_condition <- factor(c(rep("CTR", 3), rep("TRT", 3))) 
design <- model.matrix(~ test_condition) # model by 'test_condition'
rownames(design) <- colnames(norm_sub_df_edgeR) # get the correspondence between 'test_condition' and 'colnames()'

## calculate dispersion
disp_norm_sub_df_edgeR <- estimateDisp(norm_sub_df_edgeR, 
                                       design, robust=TRUE)

## fit quasi-likelihood (QL) F-test and test TRT vs. CTR
fitGLM <- glmQLFit(disp_norm_sub_df_edgeR, design) # fit QL F-Test
TRTvsCTR_testGLM <- glmQLFTest(fitGLM, coef=2) # test TRT vs. CTR

## get the result, ie, logFC, FDR and so on...
TRTvsCTR_FDR <- topTags(TRTvsCTR_testGLM, n = nrow(sub_df_edgeR$counts))
## save only the information for genes with a 'FDR<0.05' 
TRTvsCTR_FDR_sign <- TRTvsCTR_FDR$table[TRTvsCTR_FDR$table$FDR<0.05,] # sub set the df to get only the genes with a FDR<0.05

The first two lines (ctr_col <- "ctrl" and trt_col <- "trt") are just an example. There I specify the pattern for the the level that I want for the treatment level and for the reference level sample names.

Apart from the filterByExpr() code line, where I'm filtering lowly expressed genes across all samples that belong to the pairwise comparison made rather than by group, I think the code looks ok (I have 3 replicates for each genotype, condition and time). Since I'm comparing the same genotype, at the same time, with the difference relying on the treatment, I think that the differentially expressed genes that I'm getting are the down- or up-regulated genes specific to the treatment variable, for the genotype and time being compared.

I have 3 questions:

(1) Is my last sentence correct? I.e., the DEG that I'm obtaining is due to the different treatment, for a specific genotype and time;
(2) Should I go for a 3-multiple factor design (~ genotype + time + treatment) and, then, retrieve the pairwise comparisons of interest using the contrast option in glmQLFTest(fit, contrast=my.contrasts[,"contrast_condition"])?
(3) Both, (1) and (2), give what I want, but (2) is better or more statistically correct?

Thank you for any help, advice or links to better posts (I checked some at Biostars),

António

RNA-Seq edgeR • 1.3k views

ADD COMMENT • link updated 3.8 years ago by Devon Ryan 104k • written 3.8 years ago by antonioggsousa 3.2k

score 5 · Accepted Answer · 2020-08-05

As long as you're subsetting to include a single combination of genotype and time then yes, you're just getting the DE genes at that time/genotype combination. Whether that's really useful is up to you.
If you want so many pairwise comparisons you're better off with a design like ~0+group and then having groups like genotype1time1treatment1, genotype1time1treatment2, etc. (use shorter more meaningful names of course!). Having all of the samples around when doing the fit will allow better variance estimation.
See above, though I don't think either is strictly more correct, it's just a matter of what gives you a better variance estimate.