Question

Consistency of Seurat SCTransform across computers/environments

2

Entering edit mode

3.6 years ago

Bastien Hervé 5.3k

Due to Covid19 situation I am working part time at the lab and at home. At the lab I am working under an HP workstation linux environment and at home on MSI with windows 10 and Rstudio.

As input to the SCTransform function I use the same RDS object, same parameters, same R version, same Seurat version, but I get slightly different output results which lead to different UMAPs.

I have set the same seed.use SCTransform's parameter in both environment.

From the lab :

library(Seurat)
SCRNARR.list <- readRDS(file = "/mnt/raid1/Data/SCRNARR/SCRNARR_BeforeSCT.rds")
object.size(SCRNARR.list)
944243832 bytes
for (sample in 1:length(SCRNARR.list)) {
    SCRNARR.list[[sample]] <- SCTransform(SCRNARR.list[[sample]], seed.use=1447854, variable.features.n = 2000, vars.to.regress = "percent.mt", verbose = FALSE)
}
There were 50 or more warnings (use warnings() to see the first 50)
object.size(SCRNARR.list)
2131213616 bytes

From home :

library(Seurat)
SCRNARR.list <- readRDS(file = "D://KI/VM/Data/SCRNARR/SCRNARR_BeforeSCT.rds")
object.size(SCRNARR.list)
944243832 bytes
for (sample in 1:length(SCRNARR.list)) {
    SCRNARR.list[[sample]] <- SCTransform(SCRNARR.list[[sample]], seed.use=1447854, variable.features.n = 2000, vars.to.regress = "percent.mt", verbose = FALSE)
}
There were 50 or more warnings (use warnings() to see the first 50)
object.size(SCRNARR.list)
2131212976 bytes

I went back up to the first diverging point between the two methods and SCTransform function is the first function outputing different object sizes leading to different UMAP conformation.

This issue does not come neither from the randomness of UMAP creation as at home I always get the same UMAP and in the mean time at the office I get the same UMAPs, but every time UMAPs from the office and UMAPs from home are different.

Could it be related to the computer itself, like how both computers are handling floating point ?

Thanks !

Specifications :

From the lab :

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=sv_SE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=sv_SE.UTF-8    LC_MESSAGES=en_US.UTF-8  
 [7] LC_PAPER=sv_SE.UTF-8       LC_NAME=C                
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=sv_SE.UTF-8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] Seurat_3.2.0

loaded via a namespace (and not attached):
 [1] httr_1.4.2            tidyr_1.1.1           jsonlite_1.7.0      
 [4] viridisLite_0.3.0     splines_4.0.2         leiden_0.3.3        
 [7] shiny_1.5.0           ggrepel_0.8.2         globals_0.12.5      
[10] pillar_1.4.6          lattice_0.20-41       glue_1.4.1          
[13] reticulate_1.16       digest_0.6.25         polyclip_1.10-0      
[16] RColorBrewer_1.1-2    promises_1.1.1        colorspace_1.4-1    
[19] cowplot_1.0.0         htmltools_0.5.0       httpuv_1.5.4        
[22] Matrix_1.2-18         plyr_1.8.6            pkgconfig_2.0.3      
[25] listenv_0.8.0         purrr_0.3.4           xtable_1.8-4        
[28] patchwork_1.0.1       scales_1.1.1          RANN_2.6.1          
[31] tensor_1.5            later_1.1.0.1         Rtsne_0.15          
[34] spatstat.utils_1.17-0 tibble_3.0.3          mgcv_1.8-31          
[37] generics_0.0.2        ggplot2_3.3.2         ellipsis_0.3.1      
[40] ROCR_1.0-11           pbapply_1.4-3         lazyeval_0.2.2      
[43] deldir_0.1-28         survival_3.1-12       magrittr_1.5        
[46] crayon_1.3.4          mime_0.9              future_1.18.0        
[49] nlme_3.1-147          MASS_7.3-51.6         ica_1.0-2            
[52] tools_4.0.2           fitdistrplus_1.1-1    data.table_1.13.0    
[55] lifecycle_0.2.0       stringr_1.4.0         plotly_4.9.2.1      
[58] munsell_0.5.0         cluster_2.1.0         irlba_2.3.3          
[61] compiler_4.0.2        rsvd_1.0.3            rlang_0.4.7          
[64] grid_4.0.2            ggridges_0.5.2        goftest_1.2-2        
[67] RcppAnnoy_0.0.16      rappdirs_0.3.1        htmlwidgets_1.5.1    
[70] igraph_1.2.5          miniUI_0.1.1.1        gtable_0.3.0        
[73] codetools_0.2-16      abind_1.4-5           reshape2_1.4.4      
[76] R6_2.4.1              gridExtra_2.3         zoo_1.8-8            
[79] dplyr_1.0.2           uwot_0.1.8            fastmap_1.0.1        
[82] future.apply_1.6.0    KernSmooth_2.23-17    ape_5.4-1            
[85] spatstat.data_1.4-3   stringi_1.4.6         spatstat_1.64-1      
[88] parallel_4.0.2        Rcpp_1.0.5            rpart_4.1-15        
[91] vctrs_0.3.2           sctransform_0.2.1     png_0.1-7            
[94] tidyselect_1.1.0      lmtest_0.9-37

From home :

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252    LC_MONETARY=French_France.1252
[4] LC_NUMERIC=C                   LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] Seurat_3.2.0

loaded via a namespace (and not attached):
 [1] httr_1.4.2            tidyr_1.1.2           jsonlite_1.7.0        viridisLite_0.3.0    
 [5] splines_4.0.2         leiden_0.3.3          shiny_1.5.0           ggrepel_0.8.2        
 [9] globals_0.12.5        pillar_1.4.6          lattice_0.20-41       glue_1.4.2          
[13] reticulate_1.16       digest_0.6.25         polyclip_1.10-0       RColorBrewer_1.1-2  
[17] promises_1.1.1        colorspace_1.4-1      cowplot_1.0.0         htmltools_0.5.0      
[21] httpuv_1.5.4          Matrix_1.2-18         plyr_1.8.6            pkgconfig_2.0.3      
[25] listenv_0.8.0         purrr_0.3.4           xtable_1.8-4          patchwork_1.0.1      
[29] scales_1.1.1          RANN_2.6.1            tensor_1.5            later_1.1.0.1        
[33] Rtsne_0.15            spatstat.utils_1.17-0 tibble_3.0.3          mgcv_1.8-31          
[37] generics_0.0.2        ggplot2_3.3.2         ellipsis_0.3.1        ROCR_1.0-11          
[41] pbapply_1.4-3         lazyeval_0.2.2        deldir_0.1-28         survival_3.1-12      
[45] magrittr_1.5          crayon_1.3.4          mime_0.9              future_1.18.0        
[49] nlme_3.1-148          MASS_7.3-51.6         ica_1.0-2             tools_4.0.2          
[53] fitdistrplus_1.1-1    data.table_1.13.0     lifecycle_0.2.0       stringr_1.4.0        
[57] plotly_4.9.2.1        munsell_0.5.0         cluster_2.1.0         irlba_2.3.3          
[61] compiler_4.0.2        rsvd_1.0.3            rlang_0.4.7           grid_4.0.2          
[65] ggridges_0.5.2        rstudioapi_0.11       goftest_1.2-2         RcppAnnoy_0.0.16    
[69] rappdirs_0.3.1        htmlwidgets_1.5.1     igraph_1.2.5          miniUI_0.1.1.1      
[73] gtable_0.3.0          codetools_0.2-16      abind_1.4-5           reshape2_1.4.4      
[77] R6_2.4.1              gridExtra_2.3         zoo_1.8-8             dplyr_1.0.2          
[81] uwot_0.1.8            fastmap_1.0.1         future.apply_1.6.0    KernSmooth_2.23-17  
[85] ape_5.4-1             spatstat.data_1.4-3   stringi_1.4.6         spatstat_1.64-1      
[89] parallel_4.0.2        Rcpp_1.0.5            rpart_4.1-15          vctrs_0.3.2          
[93] sctransform_0.2.1     png_0.1-7             tidyselect_1.1.0      lmtest_0.9-37

Seurat • 3.3k views

ADD COMMENT • link 3.6 years ago by Bastien Hervé 5.3k

score 3 · Answer 1 · 2020-09-24

After one week on this matter I think it is time to know when to stop digging. Here is where I went so far :

You can notice that both objects before SCTransform are exactly similar, I have outputted 12 digits in this regards and the percent.mt columns are identicals.

I have investigate the SCTransform function end each variable are identicals in both methods before the vst call.

So here I am diving into the vst function trying to catch the moment where one variable is different between both computers, and I have found the reg_model_pars function.

Each given parameters are strictly the same in both methods but the outputted table model_pars_fit has different digits results.

You can see a start of digit unconsistensy in the theta column of the model_pars_fit object where the 14th digit can sometime be different in both conditions.

#From home
Browse[2]> head(model_pars)
                   theta     (Intercept)        log_umi
Gm996   0.32357670929216  -7.27629425513 1.200396174984
Miip    1.36366857498143 -10.81004819372 2.049307632658
Gatm    2.83584888603177  -9.37186401727 3.010610338152
Gm42670 0.04096150125472 -14.57765295644 2.607220585029
Igfbp6  0.00260318017233 -10.27534471280 1.474478130387
Lrrk2   0.03483076688676  -8.36917341648 0.922622854941

Browse[2]> head(model_pars_fit)
                 theta    (Intercept)       log_umi
Xkr4   0.3387578388903 -10.7343273015 2.04951110982
Sox17  0.0448549826806 -11.2091790634 1.65706231561
Mrpl15 2.7704770356570 -10.6195816599 2.45828041253
Lypla1 1.1300709637094 -10.9628137308 2.30163275018
Tcea1  4.6505611537629 -10.5167181216 2.68957605843
Rgs20  0.0639445102836 -10.8614308264 1.71239484860

#From the lab
Browse[2]> head(model_pars)
                   theta     (Intercept)        log_umi
Gm996   0.32357670929215  -7.27629425513 1.200396174984
Miip    1.36366857498148 -10.81004819372 2.049307632658
Gatm    2.83584888603177  -9.37186401727 3.010610338152
Gm42670 0.04096150125472 -14.57765295644 2.607220585029
Igfbp6  0.00260318017233 -10.27534471280 1.474478130387
Lrrk2   0.03483076688676  -8.36917341648 0.922622854941

Browse[2]> head(model_pars_fit)
                 theta    (Intercept)       log_umi
Xkr4   0.3387340069401 -10.7337639999 2.04936611746
Sox17  0.0448925740448 -11.1244329222 1.63384459020
Mrpl15 2.7704551684688 -10.6195337182 2.45826806094
Lypla1 1.1296301798296 -10.9605802435 2.30105731963
Tcea1  4.6505611537629 -10.5167181216 2.68957605843
Rgs20  0.0638077258829 -10.8789609804 1.71695512126

But after the reg_model_pars function the unconsistancy can appears at the tenth.

So once again I wanted to dig into reg_model_pars function. I was able to track the changes to this function.

model_pars_fit[o, 'dispersion_par'] <- ksmooth(x = genes_log_gmean_step1, y = model_pars[, 'dispersion_par'], x.points = x_points, bandwidth = bw, kernel='normal')$y

Which is a call to a C function and where I lost the track of the calculation process...

I am really sad not to be able to fully understand what's going on here but I am more and more convinced by a digit handling issue.

score 2 · Answer 2 · 2020-09-16

2

Entering edit mode

3.6 years ago

ATpoint 82k

What are the warnings?

I would start by checking towards SCtransform:

are the selected highly-variable genes (HVGs) exactly the same between the platforms
are the obtained Pearson residuals from SCtransform the same. I guess would be enough to check on a few samples

If both is yes (and I would expect that) then the issue is downstream. Downstream means PCA and clustering algorithms. I think Seurat uses by default Louvain clustering based on Jaccard's similarity, that is (I think) deterministic but not really sure. UMAP itself is not and will change depending on Seed, but afaik (I do not use Seurat though) there is a constant internal seed for all Seurat functions. Same should go (not sure) for PCA. At least the workflow (Bioconductor, OSCA) I follow always explicitely sets a seed prior to running PCA and clustering.

I mean, it looks like the clustering is basically the same towards the overall picture, just a bit rotated, isn't it? You have 7 clusters each with four of them forming that big Africa-shaped cloud and the other three a bit scattered around. That indicates that not a basic parameter (such as HVGs and residuals) changes but rather something more subtle, e.g. a Seed or so. Try to set explicit seeds for all downstream functions (if possible).

ADD COMMENT • link 3.6 years ago by ATpoint 82k

0

Entering edit mode

Thank you, really interesting thinking ! I dug a bit more into SCTransform and seems like the HVGs are not always the same, see below.

Warnings are the same in both cases, the same line is repeated except for warning 31 :

#Warnings
29: In theta.ml(y = y, mu = fit$fitted) : iteration limit reached
30: In theta.ml(y = y, mu = fit$fitted) : iteration limit reached
31: In sqrt(1/i) : production de NaN
32: In theta.ml(y = y, mu = fit$fitted) : iteration limit reached
33: In theta.ml(y = y, mu = fit$fitted) : iteration limit reached

Checking towards SCtransform

I have 7 different samples in SCRNARR.list

SCRNARR.list.LAB <- readRDS(file = "/mnt/raid1/Data/SCRNARR/SCRNARR_AfterSCT_LAB.rds")
object.size(SCRNARR.list.LAB)
2131213616 bytes

SCRNARR.list.HOME <- readRDS(file = "/mnt/raid1/Data/SCRNARR/SCRNARR_AfterSCT_HOME.rds")
object.size(SCRNARR.list.HOME)
2131212976 bytes

#For LAB
HVGs.LAB = list()
for (sample in 1:length(SCRNARR.list.LAB)) {
    HVGs.LAB[[sample]] <- rownames(SCRNARR.list.LAB[[sample]]@assays$SCT@meta.features[SCRNARR.list.LAB[[sample]]@assays$SCT@meta.features$sct.variable,])
}

mean.list.LAB=list()
for (sample in 1:length(SCRNARR.list.LAB )) {
    mean.list.LAB[[sample]] <- SCRNARR.list.LAB[[sample]]@assays$SCT@meta.features$sct.residual_mean
}

variance.list.LAB=list()
for (sample in 1:length(SCRNARR.list.LAB)) {
    variance.list.LAB[[sample]] <- SCRNARR.list.LAB[[sample]]@assays$SCT@meta.features$sct.residual_variance
}

#For HOME
HVGs.HOME = list()
for (sample in 1:length(SCRNARR.list.HOME)) {
    HVGs.HOME[[sample]] <- rownames(SCRNARR.list.HOME[[sample]]@assays$SCT@meta.features[SCRNARR.list.HOME[[sample]]@assays$SCT@meta.features$sct.variable,])
}

mean.list.HOME=list()
for (sample in 1:length(SCRNARR.list.HOME)) {
    mean.list.HOME[[sample]] <- SCRNARR.list.HOME[[sample]]@assays$SCT@meta.features$sct.residual_mean
}

variance.list.HOME=list()
for (sample in 1:length(SCRNARR.list.HOME)) {
    variance.list.HOME[[sample]] <- SCRNARR.list.HOME[[sample]]@assays$SCT@meta.features$sct.residual_variance
}

#Present in HOME but not in LAB
for (sample in 1:length(HVGs.HOME)) {
    print(setdiff(HVGs.HOME[[sample]], HVGs.LAB[[sample]]))
}

character(0)
character(0)
[1] "Sntg1"   "Tgfbr3l" "Apoa1"  
character(0)
character(0)
character(0)
[1] "Arhgap29" "Gm36888"  "Helz"     "Nrg1"     "Zfp369"   "Zmym6"

#Present in LAB but not in HOME
for (sample in 1:length(HVGs.HOME)) {
    print(setdiff(HVGs.LAB[[sample]], HVGs.HOME[[sample]]))
}

character(0)
character(0)
[1] "Cass4"   "Galnt17" "Grid2"  
character(0)
character(0)
character(0)
[1] "Aldh5a1" "Gm27162" "Gm36969" "Gm37452" "Ung"     "Wfdc18"


for (sample in 1:length(mean.list.LAB)) {
    print(sample)
    print(sum(mean.list.HOME[[sample]]))
    print(sum(mean.list.LAB[[sample]]))
}

[1] 1
[1] 302.9477
[1] 302.9477
[1] 2
[1] 373.7752
[1] 373.7752
[1] 3
[1] 203.9835
[1] 204.2463
[1] 4
[1] 270.4948
[1] 270.4948
[1] 5
[1] 154.8764
[1] 154.8764
[1] 6
[1] 247.2674
[1] 247.2674
[1] 7
[1] 196.836
[1] 198.1857

for (sample in 1:length(variance.list.LAB)) {
    print(sample)
    print(sum(variance.list.HOME[[sample]]))
    print(sum(variance.list.LAB[[sample]]))
}

[1] 1
[1] 25587.02
[1] 25587.02
[1] 2
[1] 30130.43
[1] 30130.43
[1] 3
[1] 20651.4
[1] 20649.48
[1] 4
[1] 25024.74
[1] 25024.74
[1] 5
[1] 19100.73
[1] 19100.73
[1] 6
[1] 18098.05
[1] 18098.05
[1] 7
[1] 16157.47
[1] 16170.22

I only plot UMAPs for sample from 1 to 4 and the screenshot in the main thread is coming from the sample 3...

I finished the analysis for both LAB and HOME objects after SCTransform on my office workstation.

For the sample 3, LAB object outputted Africa-shaped cluster in the good way and for the HOME object I got Africa-shaped cluster in reverse

How can the SCTransform function choose different HVGs from the exact same input objects, parameters ?

ADD REPLY • link 3.6 years ago by Bastien Hervé 5.3k

0

Entering edit mode

How can the SCTransform function choose different HVGs from the exact same input objects, parameters ?

Hmm, good question. Are results the same if you run several times on the same workstation, maybe test on a single sample. Just mclapply it 10 times, and see whether the function itself is fully deterministic.

ADD REPLY • link 3.6 years ago by ATpoint 82k

0

Entering edit mode

Yeah the objects are the same size every time at the lab and every time the same size at home but objects size are different from the lab compare to the ones I get from home.

mclapply(1:10, function(i) {
    object.size(SCTransform(SCRNARR.LAB, seed.use=1447854, variable.features.n = 2000, vars.to.regress = "percent.mt", verbose = FALSE))
})

[[1]]
159113712 bytes

[[2]]
159113712 bytes

[[3]]
159113712 bytes

[[4]]
159113712 bytes

mclapply(1:10, function(i) {
    object.size(SCTransform(SCRNARR.HOME, seed.use=1447854, variable.features.n = 2000, vars.to.regress = "percent.mt", verbose = FALSE))
})

[[1]]
159113040 bytes

[[2]]
159113040 bytes

[[3]]
159113040 bytes

[[4]]
159113040 bytes

ADD REPLY • link 3.6 years ago by Bastien Hervé 5.3k

0

Entering edit mode

I would check content, not size. Size is almost never a good indicator for anything:

> object.size(x = c(1,2))
64 bytes
> object.size(x = c(7,3))
64 bytes

ADD REPLY • link 3.6 years ago by ATpoint 82k

0

Entering edit mode

apply10SCTransform.HOME = list()
apply10SCTransform.HOME <- mclapply(1:10, function(i) {
    SCTransform(SCRNARR.HOME, seed.use=1447854, variable.features.n = 2000, vars.to.regress = "percent.mt", verbose = FALSE)
})
for (n in 1:9){
    print(setdiff(rownames(apply10SCTransform.HOME[[n]]@assays$SCT@meta.features[apply10SCTransform.HOME[[n]]@assays$SCT@meta.features$sct.variable,]), rownames(apply10SCTransform.HOME[[n+1]]@assays$SCT@meta.features[apply10SCTransform.HOME[[n+1]]@assays$SCT@meta.features$sct.variable,])))
}
character(0)
character(0)
character(0)
character(0)
character(0)
character(0)
character(0)
character(0)
character(0)

ADD REPLY • link 3.6 years ago by Bastien Hervé 5.3k