Hello,
I have a singlecellesperiment object with data ran in two different chips (one chip has 2 KO 1 WT, the other 2 WT 1 KO). I assessed the variance explained by different variables with the function getVarianceExplained()
1, and I could see that the batch was explaining as much as other interesting biological factors such as the genotype.
I thought that I could try a linear batch correction as the chips were balanced and I have seen there are recomended for simple batch corrections. However, after runing a linear regression method ( I tried both of the ones available in the batchelor package: regressBatches
and rescaleBatches
) the data ploted in a UMAP or tSNE looks worst than before. It almost looks as if my before and after plots were swaped!
I am not specially worried about that for this particular analysis, I guess will just cary on without batch correction; but I was just wondering if results like this one are normal/possible, or if I was violating some requirements or assumptions I am not aware of. My math level is very basic and I am not very familiar with linear regression etc.
It is the first time I post here, so please, let me know if I am missing anything or if there is a better place to post this kind of questions!
Thank you in advance,
Nadine
Code:
Before this batch correction I had:
# preprocessed the object with a QC (low umi, high mithocondrial, non-expressed genes etc)
# Normalised
quick_clusters <- quickClusters(sce)
sce <- computeSumFactors(sce, cluster = quick_clusters, min.mean = 0.1)
sce <- logNormCounts(sce)
# Feature selection and Dim reduction
# (I also tested with all genes and PCs with same result)
gene_var <- modelGeneVar(sce)
hvgs <- getTopHVGs(gene_var, prop = 0.15)
rowSubset(sce) <- hvgs
reducedDim(sce, "PCA") <- reducedDim(sce, "PCA")[, 1:25]
# Dim reduction
sce <- runPCA(sce)
sce <- runUMAP(sce, dimred = "PCA")
plotReducedDim(sce, colour_by="chip", dimred = "UMAP", point_size=0.1) +
ggtitle("Before batch correction, small dots")
The batch correction was performed with
set.seed(100)
sce <- correctExperiments(sce,
batch = factor(sce$chip),
PARAM = RescaleParam()
)
# Dimensional reduction corrected
sce <- runPCA(sce, name="PCA_corrected", exprs_values= "corrected")
sce <- runUMAP(sce, name="UMAP_corrected", dimred="PCA_corrected")
plotReducedDim(sce, colour_by="chip", dimred = "UMAP_corrected", point_size=0.1) +
ggtitle("After batch correction, small dots")
Session info
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] batchelor_1.6.2 Matrix_1.3-2
## [3] scran_1.18.5 scater_1.18.6
## [5] ggplot2_3.3.3 SingleCellExperiment_1.12.0
## [7] SummarizedExperiment_1.20.0 Biobase_2.50.0
## [9] GenomicRanges_1.42.0 GenomeInfoDb_1.26.2
## [11] IRanges_2.24.1 S4Vectors_0.28.1
## [13] BiocGenerics_0.36.0 MatrixGenerics_1.2.1
## [15] matrixStats_0.58.0 here_1.0.1
##
## loaded via a namespace (and not attached):
## [1] viridis_0.5.1 sass_0.3.1
## [3] edgeR_3.32.1 BiocSingular_1.6.0
## [5] jsonlite_1.7.2 viridisLite_0.3.0
## [7] DelayedMatrixStats_1.12.3 scuttle_1.0.4
## [9] bslib_0.2.4 statmod_1.4.35
## [11] highr_0.8 dqrng_0.2.1
## [13] GenomeInfoDbData_1.2.4 vipor_0.4.5
## [15] yaml_2.2.1 pillar_1.5.1
## [17] lattice_0.20-41 glue_1.4.2
## [19] limma_3.46.0 beachmat_2.6.4
## [21] digest_0.6.27 XVector_0.30.0
## [23] colorspace_2.0-0 htmltools_0.5.1.1
## [25] pkgconfig_2.0.3 zlibbioc_1.36.0
## [27] scales_1.1.1 RSpectra_0.16-0
## [29] ResidualMatrix_1.0.0 BiocParallel_1.24.1
## [31] tibble_3.1.0 farver_2.1.0
## [33] ellipsis_0.3.1 withr_2.4.1
## [35] magrittr_2.0.1 crayon_1.4.1
## [37] evaluate_0.14 fansi_0.4.2
## [39] bluster_1.0.0 beeswarm_0.3.1
## [41] tools_4.0.4 lifecycle_1.0.0
## [43] stringr_1.4.0 munsell_0.5.0
## [45] locfit_1.5-9.4 DelayedArray_0.16.2
## [47] irlba_2.3.3 compiler_4.0.4
## [49] jquerylib_0.1.3 rsvd_1.0.3
## [51] rlang_0.4.10 grid_4.0.4
## [53] RCurl_1.98-1.2 BiocNeighbors_1.8.2
## [55] RcppAnnoy_0.0.18 igraph_1.2.6
## [57] bitops_1.0-6 labeling_0.4.2
## [59] rmarkdown_2.7 codetools_0.2-18
## [61] gtable_0.3.0 R6_2.5.0
## [63] gridExtra_2.3 knitr_1.31
## [65] uwot_0.1.10 utf8_1.1.4
## [67] rprojroot_2.0.2 stringi_1.5.3
## [69] ggbeeswarm_0.6.0 Rcpp_1.0.6
## [71] vctrs_0.3.6 xfun_0.21
## [73] sparseMatrixStats_1.2.1
I agree that batch effect do not look bad here, as I said I will simply carry on without any kind of correction. I was just interested in understanding why this behaviour is happening, or if it is something to expect for some reason I am missing.I never though that running a batch correction method could yield a dataset less integrated than before integration.
Regarding the other integratino method, I read here that linear methods are recomended over integration methods for simple batch correction. "While data integration methods can also be applied to simple batch correction problems, we recommend to be wary of over-correction given the increased degrees of freedom of non-linear data integration approaches. For example, MNN was shown to be outperformed by linear methods such as ComBat in the simpler batch correction setting (Buttner et al, 2019)"
Interesting. I've always felt methods like ComBat always apply too heavy a hand in practice, they've often obliterated sample-specific populations in my hands. I'm surprised ComBat was their best performer across multiple datasets, but then again, I suppose the settings they looked at were pretty simple. This study found combat to consistently be among the worst performers across 7 datasets (though MNN was middle of the pack as well), preferring methods like LIGER and Harmony. Would be nice if those were less annoying to run on SCE objects.
Thanks for that paper, I really like how it also shows very different results depending on the dataset. This prooves how, unfortunately, there isn't a good answer and it really depends on the initial conditions. I think there are working in Fabian Theis group (from where was the first reference I added) on a similar kind of comparison between integration methods, so lets see if they retract their results from the previous paper that said that linear methods were better for simple batch correction. Even if at the end, as you said, we might end up picking one of the methods better implemented in the workflow we are using ^^'