I'm integrating together some in-house data with public single-cell count data pulled from GEO/Zenodo. There is a substantial batch effect as the in-house data and two studies are 10X Chromium data, and three studies are combinatorial (split-and-pool) data.
Simply running Harmony as a baseline does clean up quite a bit of the batch effect, but some yet remains
I have previously had good experience with scVI for dealing with batch effects; and it appears that sysVI is the recommended successor for substantial batch effects. I have yet to execute this correctly, and seem always to get wonky results. My pipeline:
# basic run through
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=4500, batch_key='study_id')
sc.tl.pca(adata)
sc.pp.neighbors(adata, use_rep='X_pca', n_pcs=15)
sc.tl.umap(adata)
adata.obsm['X_umap_pca'] = adata.obsm['X_umap']
# [... initial plotting omitted...]
# subset *only* to highly variable genes in at least 2 studies and reset normalized data to counts
asub = adata[:, adata.var.highly_variable_nbatches > 1].copy() # ~2800 total
asub.X = asub.layers['counts'].copy()
scvi.external.SysVI.setup_anndata(adata=asub, batch_key='study_id', categorical_covariate_keys =['target', 'group'])
scvi.settings.dl_num_workers = 16
model = scvi.external.SysVI(
adata=asub,
embed_categorical_covariates=False,
n_prior_components=7,
n_layers=3
)
max_epochs = 200
model.train(
max_epochs=max_epochs,
check_val_every_n_epoch=1,
plan_kwargs={
"kl_weight": 0.5,
"z_distance_cycle_weight": 30
}
)
The cycle loss looks totally wonky.
I've repeated this with several different plan values (including standard kl=1, zdist=2) and the integration always looks arbitrary:
I feel like I'm missing something fundamental. What is the correct way of running sysVI? Or should I just fall back to scVI?