Tutorial:Single-cell RNA-seq: Preprocessing: Data integration and batch correction-3
0
0
Entering edit mode
3 months ago
Julia Ma ▴ 120

Part-1 here: Single-cell RNA-seq: Preprocessing: Data integration and batch correction

Part-2 here: Single-cell RNA-seq: Preprocessing: Data integration and batch correction-3

Full article lifted from: https://omicverse.readthedocs.io/en/latest/Tutorials-single/t_single_batch/

scVI

An important task of single-cell analysis is the integration of several samples, which we can perform with scVI. For integration, scVI treats the data as unlabelled. When our dataset is fully labelled (perhaps in independent studies, or independent analysis pipelines), we can obtain an integration that better preserves biology using scANVI, which incorporates cell type annotation information. Here we demonstrate this functionality with an integrated analysis of cells from the lung atlas integration task from the scIB manuscript. The same pipeline would generally be used to analyze any collection of scRNA-seq datasets.

adata_scvi=ov.single.batch_correction(adata,batch_key='batch',
                           methods='scVI',n_layers=2, n_latent=30, gene_likelihood="nb")
adata

...Begin using scVI to correct batch effect

Global seed set to 0
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

Epoch 300/300: 100%|| 300/300 [05:51<00:00,  1.18s/it, loss=1.51e+03, v_num=1]

`Trainer.fit` stopped: `max_epochs=300` reached.

Epoch 300/300: 100%|| 300/300 [05:51<00:00,  1.17s/it, loss=1.51e+03, v_num=1]

AnnData object with n_obs × n_vars = 26707 × 3000
    obs: 'GEX_n_genes_by_counts', 'GEX_pct_counts_mt', 'GEX_size_factors', 'GEX_phase', 'ADT_n_antibodies_by_counts', 'ADT_total_counts', 'ADT_iso_count', 'cell_type', 'batch', 'ADT_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker', 'is_train', 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'n_genes', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'topic_0', 'topic_1', 'topic_2', 'topic_3', 'topic_4', 'topic_5', 'topic_6', 'topic_7', 'topic_8', 'topic_9', 'topic_10', 'topic_11', 'topic_12', 'topic_13', 'topic_14', 'LDA_cluster', '_scvi_batch', '_scvi_labels'
    var: 'feature_types', 'gene_id', 'mt', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
    uns: 'batch_colors', 'cell_type_colors', 'hvg', 'layers_counts', 'log1p', 'scaled|original|cum_sum_eigenvalues', 'scaled|original|pca_var_ratios', 'scrublet', 'topic_dendogram', '_scvi_uuid', '_scvi_manager_uuid'
    obsm: 'ADT_X_pca', 'ADT_X_umap', 'ADT_isotype_controls', 'GEX_X_pca', 'GEX_X_umap', 'X_combat', 'X_harmony', 'X_mde_combat', 'X_mde_harmony', 'X_mde_mira', 'X_mde_mira_feature', 'X_mde_mira_topic', 'X_mde_pca', 'X_mde_scanorama', 'X_scanorama', 'X_topic_compositions', 'X_umap_features', 'scaled|original|X_pca', 'X_scVI'
    varm: 'scaled|original|pca_loadings', 'topic_feature_activations', 'topic_feature_compositions'
    layers: 'counts', 'lognorm', 'scaled'

adata.obsm["X_mde_scVI"] = ov.utils.mde(adata.obsm["X_scVI"])

ov.utils.embedding(adata,
                basis='X_mde_scVI',frameon='small',
                color=['batch','cell_type'],show=False)

[<AxesSubplot: title={'center': 'batch'}, xlabel='X_mde_scVI1', ylabel='X_mde_scVI2'>,
 <AxesSubplot: title={'center': 'cell_type'}, xlabel='X_mde_scVI1', ylabel='X_mde_scVI2'>]

enter image description here

MIRA+CODAL

Topic modeling of batched single-cell data is challenging because these models cannot typically distinguish between biological and technical effects of the assay. CODAL (COvariate Disentangling Augmented Loss) uses a novel mutual information regularization technique to explicitly disentangle these two sources of variation.

LDA_obj=ov.utils.LDA_topic(adata,feature_type='expression',
                  highly_variable_key='highly_variable_features',
                 layers='counts',batch_key='batch',learning_rate=1e-3)

INFO:mira.adata_interface.topic_model:Predicting expression from genes from col: highly_variable_features

mira have been install version: 2.1.0

Gathering dataset statistics:   0%|          | 0/26707 [00:00<?, ?it/s]

Learning rate range test:   0%|          | 0/98 [00:00<?, ?it/s]

INFO:mira.topic_model.base:Set learning rates to: (0.0061957597093704065, 0.22248668375233174)

enter image description here

LDA_obj.plot_topic_contributions(6)

Gathering dataset statistics:   0%|          | 0/26707 [00:00<?, ?it/s]

Epoch 0:   0%|          | 0/24 [00:00<?, ?it/s]

Predicting latent vars:   0%|          | 0/105 [00:00<?, ?it/s]

enter image description here

LDA_obj.predicted(15)

INFO:mira.adata_interface.topic_model:Predicting expression from genes from col: highly_variable_features

running LDA topic predicted

Gathering dataset statistics:   0%|          | 0/26707 [00:00<?, ?it/s]

Epoch 0:   0%|          | 0/24 [00:00<?, ?it/s]

INFO:mira.topic_model.base:Moving model to device: cpu

Predicting latent vars:   0%|          | 0/105 [00:00<?, ?it/s]

INFO:mira.adata_interface.core:Added key to obsm: X_topic_compositions
INFO:mira.adata_interface.core:Added key to obsm: X_umap_features
INFO:mira.adata_interface.topic_model:Added cols: topic_0, topic_1, topic_2, topic_3, topic_4, topic_5, topic_6, topic_7, topic_8, topic_9, topic_10, topic_11, topic_12, topic_13, topic_14
INFO:mira.adata_interface.core:Added key to varm: topic_feature_compositions
INFO:mira.adata_interface.core:Added key to varm: topic_feature_activations
INFO:mira.adata_interface.topic_model:Added key to uns: topic_dendogram

finished: found 15 clusters and added
    'LDA_cluster', the cluster labels (adata.obs, categorical)

adata.obsm["X_mde_mira_topic"] = ov.utils.mde(adata.obsm["X_topic_compositions"])
adata.obsm["X_mde_mira_feature"] = ov.utils.mde(adata.obsm["X_umap_features"])

ov.utils.embedding(adata,
                basis='X_mde_mira_topic',frameon='small',
                color=['batch','cell_type'],show=False)

[<AxesSubplot: title={'center': 'batch'}, xlabel='X_mde_mira_topic1', ylabel='X_mde_mira_topic2'>,
 <AxesSubplot: title={'center': 'cell_type'}, xlabel='X_mde_mira_topic1', ylabel='X_mde_mira_topic2'>]

enter image description here

ov.utils.embedding(adata,
                basis='X_mde_mira_feature',frameon='small',
                color=['batch','cell_type'],show=False)



[<AxesSubplot: title={'center': 'batch'}, xlabel='X_mde_mira_feature1', ylabel='X_mde_mira_feature2'>,
 <AxesSubplot: title={'center': 'cell_type'}, xlabel='X_mde_mira_feature1', ylabel='X_mde_mira_feature2'>]

enter image description here

Benchmarking test

The methods demonstrated here are selected based on results from benchmarking experiments including the single-cell integration benchmarking project [Luecken et al., 2021]. This project also produced a software package called scib that can be used to run a range of integration methods as well as the metrics that were used for evaluation. In this section, we show how to use this package to evaluate the quality of an integration.

adata.write_h5ad('neurips2021_batch_all.h5ad',compression='gzip')

adata=sc.read('neurips2021_batch_all.h5ad')

adata.obsm['X_pca']=adata.obsm['scaled|original|X_pca'].copy()
adata.obsm['X_mira_topic']=adata.obsm['X_topic_compositions'].copy()
adata.obsm['X_mira_feature']=adata.obsm['X_umap_features'].copy()

from scib_metrics.benchmark import Benchmarker
bm = Benchmarker(
    adata,
    batch_key="batch",
    label_key="cell_type",
    embedding_obsm_keys=["X_pca", "X_combat", "X_harmony",
                         'X_scanorama','X_mira_topic','X_mira_feature','X_scVI'],
    n_jobs=8,
)
bm.benchmark()

bm.plot_results_table(min_max_scale=False)

enter image description here

<plottable.table.Table at 0x7f0d9a429d60>

We can find that harmony removes the batch effect the best of the three methods that do not use the GPU, scVI is method to remove batch effect using GPU.

scRNA-seq • 273 views
ADD COMMENT

Login before adding your answer.

Traffic: 1952 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6