My experimental design consists of two experimental factors, four conditions (+/+, -/+, +/-, -/-) and four biological replicate per condition. The batch effect in this experiment comes from the fact that the first replicate for all conditions was taken separately, the second separately and so on.
Now I have written a somewhat bloated R script that tries to accomplish separate two goals:
(1) find genes that are differentially expressed between presence and absence of the two experimental factors separately, i.e. which are differentially expressed due to factor 1 alone or factor 2 alone.
(2) get TMM-normalized FPKM values where the batch effect has been removed.
To accomplish (1), I have loaded my matrix of raw counts, rounded it, defined the experimental factors and batch effect as factors, filtered by total counts, created a DGEList object, carried out TMM-normalization, made a design matrix with experimental factors + batch effect (~Batch+Factor1+Factor2), estimated dispersions with estimateGLMCommonDisp(), estimateGLMTrendedDisp() and estimateGLMTagwiseDisp(), fitted with glmFit() and then used glmLRT() for statistical significance testing twice, one for each experimental factor (with the appropriate coefficients based on the design matrix, e.g. coef=5 if 6 is the column for the experimental factor I want to test, where the first column is just the row number).
... (Intercept) batch2 batch3 batch4 Factor1 Factor2 1 1 0 0 0 0 0
# Estimating average dispersion across all genes with GLMs y <- estimateGLMCommonDisp(y, design, verbose=TRUE) # Estimation of genewise dispersions y <- estimateGLMTrendedDisp(y, design) y <- estimateGLMTagwiseDisp(y, design) # Fitting genewise glms fit <- glmFit(y, design) # likelihood test lrt.factor1<- glmLRT(fit, coef=5) lrt.factor2 <- glmLRT(fit, coef=6)
To get (2), I used a slightly cumbersome method. I went back to the raw counts, log2 transformed then, ran the removeBatchEffect() function from limma specifying the same design matrix (that includes batch effect) as in (1), removed the log2 transform and then ran a Trinity script that is suppose to take raw counts and spit out TMM-normalized FPKM values. The reason I did this was because of conceptual simplicity and because it could potentially do the FPKM step by just running an existing script.
I have a few questions
i. Should the design matrix used for the limma function removeBatchEffect() not include the batch effect and only the experimental factors that you want to "keep", i.e. (~factor1+factor2)? The manual says "The design matrix is used to describe comparisons between the samples, for example treatment effects, which should not be removed."
ii. Does (2) make sense apart from the design matrix issue? Or is there a smarter way to get batch-corrected values from the edgeR procedure in (1)? If one can get batch-corrected TMM-normalized counts from it, is it possible to use cpm() and then divide by length of feature to get TMM-normalized FPKM values? Does the order of batch correction and TMM-normalization matter?
iii. Is it correct to use coef=5 if Factor1 is the column for the experimental factor I want to test (see the partial design matrix above)? Or did I inadvertently test differential expression between batches?
iv. Is it correct to use coef instead of contrast? The bottom part of my design matrix looks like this:
attr(,"contrasts") attr(,"contrasts")$batch  "contr.treatment" attr(,"contrasts")$factor1  "contr.treatment" attr(,"contrasts")$factor2  "contr.treatment"
It looks like the correct comparison(s) are already built-in to the design matrix?