Question

RNA-seq, EdgeR and batch effects

0

Entering edit mode

10.5 years ago

Ekarl2 ▴ 120

My experimental design consists of two experimental factors, four conditions (+/+, -/+, +/-, -/-) and four biological replicate per condition. The batch effect in this experiment comes from the fact that the first replicate for all conditions was taken separately, the second separately and so on.

Now I have written a somewhat bloated R script that tries to accomplish separate two goals:

(1) find genes that are differentially expressed between presence and absence of the two experimental factors separately, i.e. which are differentially expressed due to factor 1 alone or factor 2 alone.

(2) get TMM-normalized FPKM values where the batch effect has been removed.

To accomplish (1), I have loaded my matrix of raw counts, rounded it, defined the experimental factors and batch effect as factors, filtered by total counts, created a DGEList object, carried out TMM-normalization, made a design matrix with experimental factors + batch effect (~Batch+Factor1+Factor2), estimated dispersions with estimateGLMCommonDisp(), estimateGLMTrendedDisp() and estimateGLMTagwiseDisp(), fitted with glmFit() and then used glmLRT() for statistical significance testing twice, one for each experimental factor (with the appropriate coefficients based on the design matrix, e.g. coef=5 if 6 is the column for the experimental factor I want to test, where the first column is just the row number).

... (Intercept) batch2 batch3 batch4 Factor1 Factor2

1 1 0 0 0 0 0

# Estimating average dispersion across all genes with GLMs y <- estimateGLMCommonDisp(y, design, verbose=TRUE)

# Estimation of genewise dispersions
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)

# Fitting genewise glms
fit <- glmFit(y, design)

# likelihood test
lrt.factor1<- glmLRT(fit, coef=5)
lrt.factor2 <- glmLRT(fit, coef=6)

To get (2), I used a slightly cumbersome method. I went back to the raw counts, log2 transformed then, ran the removeBatchEffect() function from limma specifying the same design matrix (that includes batch effect) as in (1), removed the log2 transform and then ran a Trinity script that is suppose to take raw counts and spit out TMM-normalized FPKM values. The reason I did this was because of conceptual simplicity and because it could potentially do the FPKM step by just running an existing script.

I have a few questions

Should the design matrix used for the limma function removeBatchEffect() not include the batch effect and only the experimental factors that you want to "keep", i.e. (~factor1+factor2)? The manual says "The design matrix is used to describe comparisons between the samples, for example treatment effects, which should not be removed."
Does (2) make sense apart from the design matrix issue? Or is there a smarter way to get batch-corrected values from the edgeR procedure in (1)? If one can get batch-corrected TMM-normalized counts from it, is it possible to use cpm() and then divide by length of feature to get TMM-normalized FPKM values? Does the order of batch correction and TMM-normalization matter?
Is it correct to use coef=5 if Factor1 is the column for the experimental factor I want to test (see the partial design matrix above)? Or did I inadvertently test differential expression between batches?
Is it correct to use coef instead of contrast? The bottom part of my design matrix looks like this:

attr(,"contrasts") attr(,"contrasts")$batch [1] "contr.treatment"

attr(,"contrasts")$factor1
[1] "contr.treatment"

attr(,"contrasts")$factor2
[1] "contr.treatment"

It looks like the correct comparison(s) are already built-in to the design matrix?

RNA-Seq removeBatchEffect edgeR batch-effect • 8.9k views

ADD COMMENT • link updated 14 months ago by Ram 45k • written 10.5 years ago by Ekarl2 ▴ 120

0

Entering edit mode

Why are you rounding counts once you input them? That suggests that they're not integers to begin with and that you should be using limma/voom rather than edgeR.

ADD REPLY • link 10.5 years ago by Devon Ryan 105k

1

Entering edit mode

They are non-integers because RSEM handles multi-mapping reads by assigning a fraction of the read based on proportion of uniquely mapping reads and I round them because not doing so produces an error message in a later step and the Trinity pipeline that uses qCML instead of GLMs does it.

The EdgeR user manual says that "edgeR is not designed to work with estimated expression levels, for example as might be output by Cuinks. edgeR can work with expected counts as output by RSEM, but raw counts are still preferred". This suggests that it is acceptable to use edgeR with this data?

ADD REPLY • link 10.5 years ago by Ekarl2 ▴ 120

1

Entering edit mode

I'd personally prefer to stick with limma in cases like this (I know Gordon Smyth has recommended that in the past, though it could well be that his recommendation has evolved since I last looked, so if the edgeR manual explicitly gives its OK then I wouldn't worry about the validity of the results).

ADD REPLY • link 10.5 years ago by Devon Ryan 105k

0

Entering edit mode

After I get this to work, I will be exploring voom(). Thank you for your suggestions.

ADD REPLY • link 10.5 years ago by Ekarl2 ▴ 120

score 1 · Answer 1 · 2014-12-28

i. The design matrix supplied to removeBatchEffects() should not contain the batches, that's what the batches= parameter is for.

ii. I don't recall there being a way within edgeR to remove batch effects, though I don't pretend to be an expert in it (post over on the bioconductor support site and you'll get a definitive answer from Gordon Smyth). But anyway, I would probably just stick to limma, rather than going back and forth between packages. Within edgeR, there's actually an FPKM (or maybe RPKM, I don't recall which) function that's just a tweaked version of cpm(). I'd use that if I were to stick to edgeR rather than rolling my own function (why reinvent the wheel?). Regarding the order of operations, realistically I suspect that you could get somewhat different results depending on the order, due to floating point operations if nothing else. Honestly, the difference should be quite small, so since you need to normalize before doing the batch correction anyway, I'd just go with that order (the alternative is much more work for highly questionable gain).

iii. That appears correct. Coef=1 is the intercept and they're sequential after that.

iv. It depends on the question you want to ask. For the ones you've described, just using coefficients is simplest.