I was recently going through the concepts required to remove batch effects from gene expression data. In particular, I was going through this tutorial by Jeff Leek and trying out the codes mentioned there. From what I have understood from the analysis, SVA is similar to PCA and tries to find two components that try to explain any unknown variations in the data (or batch effects) in an unsupervised fashion. However, I wanted to clarify one plot from the tutorial. First, they derive the sva components using the following code.
mod = model.matrix(~cancer,data=pheno)
mod0 = model.matrix(~1, data=pheno)
sva1 = sva(edata,mod,mod0,n.sv=2)
Then they see whether any of the sva components correlate with the batch effects
summary(lm(sva1$sv ~ pheno$batch))
They get the following results
## Response Y1 :
##
## Call:
## lm(formula = Y1 ~ pheno$batch)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26953 -0.11076 0.00787 0.10399 0.19069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.018470 0.038694 -0.477 0.635
## pheno$batch 0.006051 0.011253 0.538 0.593
##
## Residual standard error: 0.1345 on 55 degrees of freedom
## Multiple R-squared: 0.00523, Adjusted R-squared: -0.01286
## F-statistic: 0.2891 on 1 and 55 DF, p-value: 0.5929
##
##
## Response Y2 :
##
## Call:
## lm(formula = Y2 ~ pheno$batch)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.23973 -0.07467 -0.02157 0.08116 0.25629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.121112 0.034157 3.546 0.000808 ***
## pheno$batch -0.039675 0.009933 -3.994 0.000194 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1187 on 55 degrees of freedom
## Multiple R-squared: 0.2248, Adjusted R-squared: 0.2107
## F-statistic: 15.95 on 1 and 55 DF, p-value: 0.0001945
It is quite clear that the second component is interesting because of the fit having a significant p value. Now they draw a boxplot with the second component as the y axis and the batch effects as the x axis.
boxplot(sva1$sv[,2] ~ pheno$batch)
This is where I am stuck. I am unable to interpret this plot.
Thanks a lot for your explanation. The concept is more clear to me now. Basically, find the set of covariates that best explains the variability in the dataset due to batch effect and produce a boxplot of the inferred covariate (sv[,2]) in this case vs the actual effect. Could you also please let me know a way to plot expression values before and after batch effects?
That last question is very relevant: when you use SVA or any linear model construction with covariates, this may, internally, substantially modify the expression values for which you perform the statistical tests, so that they are not really the same values as the initial measurements. It is generally recommended to not correct your values and then test, rather, you should input the un-adjusted values and specify the covariates (SVAs, etc.) in the model and test that. Nonetheless, as you say, you may want to plot the adjusted values to compare to the initial values. This answer describes a function to do so (and the same reasoning could be applied to any covariate, in the end you are simply "regressing out" covariates from the data, be they SVs or anything else). I believe it comes from this paper which may also help clarify concepts.
Understood. So basically I wanted to see the before and after plots and check how one can use sva to discover the unwanted variations in your data. However, for obtaining the batch corrected data, I am using Combat and obtaining the cleaned expression matrix outputted by the same. Hope that's ok.
For the purpose of exploratory analysis, it's OK. But I would not adjust data for posterior testing. Discussions like this one may also be helpful.
Thanks a lot for your help and for providing the links. It's just that I have gone through quite a few papers that have adjusted for batch effects using Combat and used the corrected data for finding DEGs. But after learning about the apparent perils of going down that road, I am pretty sure I won't be repeating the same mistakes.