Question: Batch effect removal with the help of sva
1
4 weeks ago by
banerjeeshayantan190 wrote:

I was recently going through the concepts required to remove batch effects from gene expression data. In particular, I was going through this tutorial by Jeff Leek and trying out the codes mentioned there. From what I have understood from the analysis, SVA is similar to PCA and tries to find two components that try to explain any unknown variations in the data (or batch effects) in an unsupervised fashion. However, I wanted to clarify one plot from the tutorial. First, they derive the sva components using the following code.

``````mod = model.matrix(~cancer,data=pheno)
mod0 = model.matrix(~1, data=pheno)
sva1 = sva(edata,mod,mod0,n.sv=2)
``````

Then they see whether any of the sva components correlate with the batch effects

``````summary(lm(sva1\$sv ~ pheno\$batch))
``````

They get the following results

``````## Response Y1 :
##
## Call:
## lm(formula = Y1 ~ pheno\$batch)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -0.26953 -0.11076  0.00787  0.10399  0.19069
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.018470   0.038694  -0.477    0.635
## pheno\$batch  0.006051   0.011253   0.538    0.593
##
## Residual standard error: 0.1345 on 55 degrees of freedom
## Multiple R-squared:  0.00523,    Adjusted R-squared:  -0.01286
## F-statistic: 0.2891 on 1 and 55 DF,  p-value: 0.5929
##
##
## Response Y2 :
##
## Call:
## lm(formula = Y2 ~ pheno\$batch)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -0.23973 -0.07467 -0.02157  0.08116  0.25629
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.121112   0.034157   3.546 0.000808 ***
## pheno\$batch -0.039675   0.009933  -3.994 0.000194 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1187 on 55 degrees of freedom
## Multiple R-squared:  0.2248, Adjusted R-squared:  0.2107
## F-statistic: 15.95 on 1 and 55 DF,  p-value: 0.0001945
``````

It is quite clear that the second component is interesting because of the fit having a significant p value. Now they draw a boxplot with the second component as the y axis and the batch effects as the x axis.

``````boxplot(sva1\$sv[,2] ~ pheno\$batch)
``````

This is where I am stuck. I am unable to interpret this plot.

modified 4 weeks ago by Papyrus550 • written 4 weeks ago by banerjeeshayantan190
4
4 weeks ago by
Papyrus550
Papyrus550 wrote:

When you perform SVA, the method estimates surrogate variables which explain variance in your data (in the case of SVA, only that variance which is not explained by your main comparison).

The surrogate variables (SVs) that are thus estimated, included in the `svobj` object, have 1 value for each sample. Thus, to check if the SVs are associated to the batch variable, a linear model is fit (`lm`). For the second SV, the coefficient tells you that, changing the batch, the value of the SV changes (by `-0.039675`), and this is significant, so it provides proof that the SV is associated to the batch. It is the equivalent of doing an ANOVA and checking if the values of the SV are different between the batches: i.e. if the means between the boxplots are different, which is what is plotted in the figure: the values of the SV, for each sample, segregated into the batch groups.

Additionally, I would add, because you mention that:

It is quite clear that the second component is interesting because of the fit having a significant p value

Be careful with this: SVA estimates surrogate variables which explain your data. The importance of these surrogate variables particularly depends on how much variance they explain, not on whether they happen to correlate with one of the variables which you have measured in your experiment (such as the batch). What I mean is that, if you had not measured the batch effect, that SV would still be true and important, so I would not say that the importance on the SV relies on its correlation to other measured variables: SVA will find covariates explaining your data whether you happen to know them or not (and that's the nice thing about the method).

Thanks a lot for your explanation. The concept is more clear to me now. Basically, find the set of covariates that best explains the variability in the dataset due to batch effect and produce a boxplot of the inferred covariate (sv[,2]) in this case vs the actual effect. Could you also please let me know a way to plot expression values before and after batch effects?

1

That last question is very relevant: when you use SVA or any linear model construction with covariates, this may, internally, substantially modify the expression values for which you perform the statistical tests, so that they are not really the same values as the initial measurements. It is generally recommended to not correct your values and then test, rather, you should input the un-adjusted values and specify the covariates (SVAs, etc.) in the model and test that. Nonetheless, as you say, you may want to plot the adjusted values to compare to the initial values. This answer describes a function to do so (and the same reasoning could be applied to any covariate, in the end you are simply "regressing out" covariates from the data, be they SVs or anything else). I believe it comes from this paper which may also help clarify concepts.

Understood. So basically I wanted to see the before and after plots and check how one can use sva to discover the unwanted variations in your data. However, for obtaining the batch corrected data, I am using Combat and obtaining the cleaned expression matrix outputted by the same. Hope that's ok.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by banerjeeshayantan190
1

For the purpose of exploratory analysis, it's OK. But I would not adjust data for posterior testing. Discussions like this one may also be helpful.

1

Thanks a lot for your help and for providing the links. It's just that I have gone through quite a few papers that have adjusted for batch effects using Combat and used the corrected data for finding DEGs. But after learning about the apparent perils of going down that road, I am pretty sure I won't be repeating the same mistakes.