Is there any journal article to reference removeBatchEffect function from Limma?
1
0
Entering edit mode
21 months ago
ev97 ▴ 20

I am trying to correct my RNA data from 3 sources of variation. I wrote this post asking about my problem and one of the users ( ATpoint) has recommended me to use removeBatchEffect from limma.

However, I cannot find any published information apart from this one to see how it works, in which cases you should use it, comparisons between other methods, etc. In general, more information to reference the reason why I chose this method and not another one.

In my case, I know that I have to do this to be able to adjust by 3 variables...

On the other hand, I would like to confirm if it is possible to keep and not touch the biological difference from my data such as age, sex and group and WBC count (several biological variables). Is there a way to preserve it? because the only mention that I always find is about preserving the treatment information.

The design matrix is used to describe comparisons between the samples, for example treatment effects, that should not be removed. The function (in effect) fits a linear model to the data, including both batches and regular treatments, then removes the component due to the batch effects

Note that my 2 main objectives are:

  • Get adjusted counts for future analyses.
  • Run linear mixed models (with variancePartition package) in order to assess the contribution of several variables (such as age, diabetes, sex, hypertension, cholesterol, etc) to the expression variation of each gene.

Could anybody help me, please?

Thanks very much in advance

batch removebatcheffect limma rna-seq batch-effect • 1.4k views
ADD COMMENT
0
Entering edit mode

Thanks very much for your replies, I really appreciate all the information that you provided. Thanks

ADD REPLY
5
Entering edit mode
21 months ago

There is no citation for remoteBatchEffects, mostly because it is so simple.

Here is the complete code:

function (x, batch = NULL, batch2 = NULL, covariates = NULL, 
    design = matrix(1, ncol(x), 1), ...) 
{
    if (is.null(batch) && is.null(batch2) && is.null(covariates)) 
        return(as.matrix(x))
    if (!is.null(batch)) {
        batch <- as.factor(batch)
        contrasts(batch) <- contr.sum(levels(batch))
        batch <- model.matrix(~batch)[, -1, drop = FALSE]
    }
    if (!is.null(batch2)) {
        batch2 <- as.factor(batch2)
        contrasts(batch2) <- contr.sum(levels(batch2))
        batch2 <- model.matrix(~batch2)[, -1, drop = FALSE]
    }
    if (!is.null(covariates)) 
        covariates <- as.matrix(covariates)
    X.batch <- cbind(batch, batch2, covariates)
    fit <- lmFit(x, cbind(design, X.batch), ...)
    beta <- fit$coefficients[, -(1:ncol(design)), drop = FALSE]
    beta[is.na(beta)] <- 0
    as.matrix(x) - beta %*% t(X.batch)
}

In the simplest case of a single batch, and removing the lines to catch edge cases and errors, this can be simplified to:

function (x, batch = NULL, design = matrix(1, ncol(x), 1)) 
{
    batch <- model.matrix(~batch)[, -1, drop = FALSE]
    fit <- lmFit(x, cbind(design, batch))
    beta <- fit$coefficients[, -(1:ncol(design)), drop = FALSE]
    as.matrix(x) - beta %*% t(batch)
}

If you had 8 samples, two conditions (c(0,0,0,0,1,1,1,1)) and two batches c(1,1,0,0,1,1,0,0), then this would simply furhter to:

design <- model.matrix(~condition + batch)
fit <- lmFit(x, design)
corrected <- as.matrix(x) - fit$coefficents[,3] %*% t(batch)

The code basically fits a GLM which has all the terms in your design matrix, plus new ones added for the batches and the covariates. Then it predicts your Y (counts) using only the fitted coefficients, returning the residuals. Its such a common and common sense statistical approach, that I'm not sure anyone would bother to publish on it.

By the way, if you want to correct for more than two factors at once, you just need to use the "covariates" argument to pass the design matrix of all three of your nuisance variables. .

ADD COMMENT
2
Entering edit mode

Just noting that your simplified code will give different results (different by an additive constant for each gene) because the contrast step has been omitted.

ADD REPLY
0
Entering edit mode

Sorry, yes, thT is true, but I think it illustrates the concept still.

ADD REPLY

Login before adding your answer.

Traffic: 1690 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6