Question: removing batch effects using ComBat and SVA
3
gravatar for LJ
3.2 years ago by
LJ180
LJ180 wrote:

Dear all,

I have a problem in processing my microarrays data.And I know how the batch effect is structured,soI would like to remove batch effect using ComBat. But I also want to adjust for unknown sources of noise using sva. Is there a way of combining these two methods? And is it feasible that i run ComBat first, get the ComBat-adjusted matrix, and run sva on this? Is it bad practice to run a second adjustment on data which has already been adjusted by a different package. Thanks in advance.

batch effect combat sva R • 11k views
ADD COMMENTlink modified 6 months ago by Agustin Gonzalez-Vicente40 • written 3.2 years ago by LJ180

Hello, i am checking SVA for estimating batches in my expression atlas. this expression atlas is composed my multiple experiments (different dates) and multiple tissues analysed. Contrarily to the general aim of this analysis, i am not interested in performing a DE analysis, but i am interested in normalizing this data for a following a network analysis. For this reason i'll consider the two methods:

(1) SVA : identifying and estimating surrogate variables for unknown sources of variation in high-throughput experiments. (2) ComBat: directly removing known batch effects using ComBat in SVA.

For the 1st method what will be the full model matrix that i have to set?

Hope it is clear, Thanks in advance.

ADD REPLYlink written 2.1 years ago by lessismore660

Hi, I cannot download the bladderdata example and I am having problems with the format of the matrices to run ComBat.

How dat, batch and mod should be formatted?

Thanks.

ADD REPLYlink written 6 months ago by Agustin Gonzalez-Vicente40
5
gravatar for Ar
3.2 years ago by
Ar890
United States
Ar890 wrote:

Ideal way would be: Get the batches from sva (if you are using microarray) or svaseq (RNA-Seq) and provide the batch vector to ComBat. ComBat will give you Batch Corrected matrix. Do the PCA plot and check if the samples are clustering based on genotypes or batches.

And is it feasible that i run ComBat first, get the ComBat-adjusted matrix, and run sva on this? For, running the ComBat you need batch information. There is no need to run sva on Batch adjusted matrix. The difference between ComBat and sva apart from their method is that ComBat does Batch Correction however, sva finds the batches. You may do Batch Correction using sva's surrogate variable by doing simple linear algebra. However, ComBat works fine too.

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Ar890

Thanks for your reply.

(1)My microarrays data was collected in two experiments,and I do the PCA plot, finding the samples are clustering based on the two experiment batches.

So,i just set the batch information as the two experiment batches(1,1,1,...,2,2,2,...),then run the ComBat with the batches.The Batch adjusted matrix was the corrected expression data.And there is no need to run sva for removing the latent factors. Is it right?

(2)Or i run sva for identifying the surrogate variables,instead of using the two experiments batches. And the following code here is the sva package example:

mod = model.matrix(~as.factor(cancer), data=pheno)
 mod0 = model.matrix(~1,data=pheno)
 n.sv = num.sv(edata,mod,method="leek")
 svobj = sva(edata,mod,mod0,n.sv=n.sv)#### apply the sva function to estimate the surrogate variables

What does the variable of interest mean? Does it mean that we correct the data by removing cancer factor effects and batch effects? And how do i creat the full model matrix,cause i do not have other variable of interest like the example.

Looking forward to your reply,thanks in advance.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by LJ180

Yeah, the first one is correct. Check the PCA plot on corrected batch matrix and you may use limma to do DEGs analysis. I think even if you run sva and check the plots, you will see two batches but its worth trying. I didn't understand your question regarding variable of interest.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Ar890
1

Thanks.One more question for you,and the following code is what i used in ComBat:

> pheno
              subgroups batch
    sample1           N     1
    sample2           N     1
    sample3           N     1
    sample4           N     1
    sample5           T     1
    sample6           N     1
    sample7           N     1
    sample8           S     1
    sample9           N     1
    sample10          N     1
    sample11          N     1
    sample12          N     1
    sample13          N     1
    sample14          S     1
    sample15          S     1
    sample16          N     1
    sample17          N     1
    sample18          T     1
    sample19          N     1
    sample20          N     1
    sample21          N     1
    sample22          T     1
    sample23          N     1
    sample24          N     1
    sample25          N     1
    sample26          N     1
    sample27          T     1
    sample28          N     1
    sample29          N     1
    sample30          T     1
    sample31          T     1
    sample32          T     1
    sample33          N     1
    sample34          T     1
    sample35          T     1
    sample36          S     1
    sample37          S     1
    sample38          N     1
    sample39          T     1
    sample40          S     1
    sample41          N     1
    sample42          T     2
    sample43          T     2
    sample44          T     2
    sample45          T     2
    sample46          S     2
    sample47          T     2
    sample48          T     2
    sample49          T     2
    sample50          T     2
    sample51          N     2
    sample52          N     2
    sample53          N     2
    sample54          N     2
    sample55          S     2
    sample56          S     2
    sample57          T     2
    sample58          T     2
    sample59          S     2
    sample60          T     2
    sample61          T     2
    sample62          T     2
    sample63          T     2
    sample64          N     2
    sample65          N     2
    sample66          N     2
    sample67          N     2
    sample68          N     2
    sample69          N     2
    sample70          N     2
    sample71          N     2
    sample72          N     2
    sample73          N     2
    sample74          N     2
    sample75          S     2
    sample76          S     2
    sample77          S     2
    sample78          S     2
    sample79          S     2
    sample80          S     2
    sample81          S     2
    sample82          S     2
    sample83          S     2
    sample84          S     2
    sample85          S     2
    sample86          S     2
    sample87          S     2
    sample88          S     2
    sample89          T     2
    sample90          T     2
    sample91          T     2
    sample92          T     2
    sample93          T     2
    sample94          T     2
    sample95          T     2
    sample96          T     2
    sample97          T     2
    sample98          T     2
    sample99          T     2
    sample100         T     2

batch<-pheno$batch
modcombat<-model.matrix(~1, data=pheno)
combat_mydata= ComBat(dat=mydata, batch=batch, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)

And if i set

modecombat1<-model.matrix(~subgroups, data=pheno) 
combat_mydata= ComBat(dat=mydata, batch=batch, mod=modcombat1, par.prior=TRUE,prior.plots=FALSE)

so what is the difference between the two codes? And this is what i understand:(1)the former means we suppose the variances are all caused by batches,we correct data by removing batches effects,but at the same time,the variances caused by subgroups are eliminated ;(2)the latter means we think the variances are caused by batches and subgroups,so we correct data with the subgroups variances retained.

So,i should choose modecombat1<-model.matrix(~subgroups, data=pheno),am i right? Thanks.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by LJ180
1

You should do :

batch<-pheno$batch
modcombat<-model.matrix(~1, data=pheno)
combat_mydata= ComBat(dat=mydata, batch=batch, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)

because you are regressing the batches and trying to removing any of the unwanted variation and not biological variation. In case of

modecombat1<-model.matrix(~subgroups, data=pheno)

you are regressing on subgroups or trying to remove the biological variation from the subgroups. Now, doing that your batch corrected matrix will more-o-less have no variation between them and therefore your Diff expression genes (DEGs) analysis will yield no statistically significant genes. You use subgroups information when you are performing limma / DEGs analysis and not when you are trying to remove the batch effects. The goals of two tasks are different.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Ar890

So the code

modcombat<-model.matrix(~1, data=pheno)
combat_mydata<-ComBat(dat=mydata, batch=batch, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)

only removes batch effects but subgroups variation retained. However,when i add a covariate level--subgroups,the following code

modcombat<-model.matrix(~subgroups, data=pheno)
combat_mydata<-ComBat(dat=mydata, batch=batch, mod=modcombat1, par.prior=TRUE, prior.plots=FALSE)

not only removes batch effects(that's what i want),but also removes biological variation from the subgroups.'cause i just only want to remove batch effects,so the former should be used? And when adding a covariate level to the mod,it means that you want to remove batch effects and some other variation caused by the covariate level simultaneously?

Many thanks.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by LJ180

Yes, use this code only.

modcombat<-model.matrix(~1, data=pheno)
combat_mydata<-ComBat(dat=mydata, batch=batch, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)
ADD REPLYlink written 3.2 years ago by Ar890
2

ok,but i just saw a explanation for Covariates in ComBat here from Dr.Johnson (the author of "sva" package),it seems that he gave a opposite meaning of covariates, and the following is the answer to the purpose of including covariates in the mode:

Sorry about the late response. However, this is the most common question I answer on this forum.

As far as including covariates: covariates should ALWAYS be added unless you have a completely balanced factorial design, in which it won't make much difference whether you add them or not. However, if you have an unbalanced proportion of treatment/controls or if some treatments are missing from one of the batches, you MUST include covariates or you risk introducing bias or removing biological signal from the data.

Just to shed so light on why this is true, just think of a simple regression or ANOVA scenario where you have two possibly correlated predictors or factors predicting a response. My favorite example (that I use for my undergraduate students) is the relationship between shark bites (response) and ice cream sales (explanatory). If you fit the regression, you will clearly see a strong correlation between the two variables. However, if you introduce a second explanatory variable, temperature, you will then see no relationship between ice cream and shark bites. Now, if you "adjust" shark bites for ice cream without taking temperature into account, you will remove part of the relationship of interest (shark v. temp) and may not see any correlation between shark bites and temperature at all.

Now completing the analogy, let shark bites be gene expression, ice cream be batch, and temp be the biological condition of interest. I hope it is clear now why its important to include covariates!

Evan

So,i am geting more confused, am i missing something improtant?

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by LJ180

Probably you are right. Thanks for letting me know. With my limited experience in case of Batch Correction using ComBat, I have not seen models including covariates. I could be wrong because in my case, I may have balanced proportion of treatment/controls. You may try both the models and see how your PCA plots look like and see which one is giving you better clusters. However, I would believe the authors more than my experience.

ADD REPLYlink written 3.2 years ago by Ar890
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 794 users visited in the last hour