Question

Tutorial:Single-cell RNA-seq: Annotation: Celltype auto annotation with SCSA

1

Entering edit mode

9 months ago

Julia Ma ▴ 120

Part-2 here: Single-cell RNA-seq: Annotation: Celltype auto annotation with SCSA-2

Part-3 here: Single-cell RNA-seq: Annotation: Celltype auto annotation with SCSA-3

Full article lifted from: https://omicverse.readthedocs.io/en/latest/Tutorials-single/t_cellanno/

Single-cell transcriptomics allows the analysis of thousands of cells in a single experiment and the identification of novel cell types, states and dynamics in a variety of tissues and organisms. Standard experimental protocols and analytical workflows have been developed to create single-cell transcriptomic maps from tissues.

This tutorial focuses on how to interpret this data to identify cell types, states, and other biologically relevant patterns with the goal of creating annotated cell maps.

Paper: SCSA: A Cell Type Annotation Tool for Single-Cell RNA-seq Data

Code: https://github.com/bioinfo-ibms-pumc/SCSA

Colab_Reproducibility: https://colab.research.google.com/drive/1BC6hPS0CyBhNu0BYk8evu57-ua1bAS0T?usp=sharing

Note

The annotation with SCSA can't be used in rare celltype annotations

enter image description here

import omicverse as ov
print(f'omicverse version:{ov.__version__}')
import scanpy as sc
print(f'scanpy version:{sc.__version__}')
ov.ov_plot_set()

omicverse version:1.5.1
scanpy version:1.7.2

Loading data

The data consist of 3k PBMCs from a Healthy Donor and are freely available from 10x Genomics (here from this webpage). On a unix system, you can run the following to download and unpack the data. The last line creates a directory for writing processed data.

mkdir data
wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
mkdir write

Read in the count matrix into an AnnData object, which holds many slots for annotations and different representations of the data. It also comes with its own HDF5-based file format: .h5ad.

adata = sc.read_10x_mtx(
    'data/filtered_gene_bc_matrices/hg19/',  # the directory with the `.mtx` file
    var_names='gene_symbols',                # use gene symbols for the variable names (variables-axis index)
    cache=True)                              # write a cache file for faster subsequent reading

... reading from cache file cache/data-filtered_gene_bc_matrices-hg19-matrix.h5ad

Data preprocessing

Here, we use ov.single.scanpy_lazy to preprocess the raw data of scRNA-seq, it included filter the doublets cells, normalizing counts per cell, log1p, extracting highly variable genes, and cluster of cells calculation.

But if you want to experience step-by-step preprocessing, we also provide more detailed preprocessing steps here, please refer to our preprocess chapter for a detailed explanation.

We stored the raw counts in count layers, and the raw data in adata.raw.to_adata().

#adata=ov.single.scanpy_lazy(adata)

#quantity control
adata=ov.pp.qc(adata,
              tresh={'mito_perc': 0.05, 'nUMIs': 500, 'detected_genes': 250})
#normalize and high variable genes (HVGs) calculated
adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,)

#save the whole genes and filter the non-HVGs
adata.raw = adata
adata = adata[:, adata.var.highly_variable_features]

#scale the adata.X
ov.pp.scale(adata)

#Dimensionality Reduction
ov.pp.pca(adata,layer='scaled',n_pcs=50)

#Neighbourhood graph construction
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50,
               use_rep='scaled|original|X_pca')

#clusters
sc.tl.leiden(adata)

#Dimensionality Reduction for visualization(X_mde=X_umap+GPU)
adata.obsm["X_mde"] = ov.utils.mde(adata.obsm["scaled|original|X_pca"])
adata

Calculate QC metrics
End calculation of QC metrics.
Original cell number: 2700
Begin of post doublets removal and QC plot
Running Scrublet
filtered out 19024 genes that are detected in less than 3 cells
normalizing counts per cell
    finished (0:00:00)
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
normalizing counts per cell
    finished (0:00:00)
normalizing counts per cell
    finished (0:00:00)
Embedding transcriptomes using PCA...
Automatically set threshold at doublet score = 0.31
Detected doublet rate = 1.4%
Estimated detectable doublet fraction = 35.1%
Overall doublet rate:
    Expected   = 5.0%
    Estimated  = 4.0%
    Scrublet finished (0:00:02)
Cells retained after scrublet: 2662, 38 removed.
End of post doublets removal and QC plots.
Filters application (seurat or mads)
Lower treshold, nUMIs: 500; filtered-out-cells: 0
Lower treshold, n genes: 250; filtered-out-cells: 3
Lower treshold, mito %: 0.05; filtered-out-cells: 56
Filters applicated.
Total cell filtered out with this last --mode seurat QC (and its chosen options): 59
Cells retained after scrublet and seurat filtering: 2603, 97 removed.
filtered out 19107 genes that are detected in less than 3 cells
Begin robust gene identification
After filtration, 13631/13631 genes are kept. Among 13631 genes, 13631 genes are robust.
End of robust gene identification.
Begin size normalization: shiftlog and HVGs selection pearson
normalizing counts per cell The following highly-expressed genes are not considered during normalization factor computation:
[]
    finished (0:00:00)
extracting highly variable genes
--> added
    'highly_variable', boolean vector (adata.var)
    'highly_variable_rank', float vector (adata.var)
    'highly_variable_nbatches', int vector (adata.var)
    'highly_variable_intersection', boolean vector (adata.var)
    'means', float vector (adata.var)
    'variances', float vector (adata.var)
    'residual_variances', float vector (adata.var)
End of size normalization: shiftlog and HVGs selection pearson
... as `zero_center=True`, sparse input is densified and may lead to large memory consumption
computing neighbors

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:01)
running Leiden clustering
    finished: found 11 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)

AnnData object with n_obs × n_vars = 2603 × 2000
    obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'leiden'
    var: 'gene_ids', 'mt', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
    uns: 'scrublet', 'log1p', 'hvg', 'scaled|original|pca_var_ratios', 'scaled|original|cum_sum_eigenvalues', 'neighbors', 'leiden'
    obsm: 'scaled|original|X_pca', 'X_mde'
    varm: 'scaled|original|pca_loadings'
    layers: 'counts', 'scaled', 'lognorm'
    obsp: 'distances', 'connectivities'

Cell annotate automatically

We create a pySCSA object from the adata, and we need to set some parameter to annotate correctly.

In normal annotate, we set celltype='normal' and target='cellmarker' or 'panglaodb' to perform the cell annotate.

But in cancer annotate, we need to set the celltype='cancer' and target='cancersea' to perform the cell annotate.

Note

The annotation with SCSA need to download the database at first. It can be downloaded automatically. But sometimes you will have problems with network errors.

The database can be downloaded from figshare or Google Drive. And you need to set parameter model_path='path'

scsa=ov.single.pySCSA(adata=adata,
                      foldchange=1.5,
                      pvalue=0.01,
                      celltype='normal',
                      target='cellmarker',
                      tissue='All',
                      model_path='temp/pySCSA_2023_v2_plus.db'                    
)

In the previous cell clustering we used the leiden algorithm, so here we specify that the type is set to leiden. if you are using louvain, please change it. And, we will annotate all clusters, if you only want to annotate a few of the classes, please follow '3', '[1,2,3]', '[...]' Enter in the format.

rank_rep means the sc.tl.rank_genes_groups(adata, clustertype, method='wilcoxon'), if we provided the rank_genes_groups in adata.uns, rank_rep can be set as False

anno=scsa.cell_anno(clustertype='leiden',
               cluster='all',rank_rep=True)

ranking genes
    finished (0:00:01)
...Auto annotate cell
Version V2.1 [2023/06/27]
DB load: GO_items:47347,Human_GO:3,Mouse_GO:3,
CellMarkers:82887,CancerSEA:1574,PanglaoDB:24223
Ensembl_HGNC:61541,Ensembl_Mouse:55414
<omicverse.single._SCSA.Annotator object at 0x2c89d0220>
Version V2.1 [2023/06/27]
DB load: GO_items:47347,Human_GO:3,Mouse_GO:3,
CellMarkers:82887,CancerSEA:1574,PanglaoDB:24223
Ensembl_HGNC:61541,Ensembl_Mouse:55414
load markers: 70276
Cluster 0 Gene number: 75
Other Gene number: 1583
Cluster 1 Gene number: 144
Other Gene number: 1540
Cluster 10 Gene number: 123
Other Gene number: 1568
Cluster 2 Gene number: 515
Other Gene number: 1532
Cluster 3 Gene number: 129
Other Gene number: 1544
Cluster 4 Gene number: 82
Other Gene number: 1595
Cluster 5 Gene number: 931
Other Gene number: 1286
Cluster 6 Gene number: 243
Other Gene number: 1508
Cluster 7 Gene number: 5
Other Gene number: 1612
Cluster 8 Gene number: 67
Other Gene number: 1612
Cluster 9 Gene number: 587
Other Gene number: 1401
#Cluster Type Celltype Score Times
['0', '?', 'T cell|Naive CD8+ T cell', '8.725245970272397|5.338571100473261', 1.6343785267744977]
['1', 'Good', 'T cell', 12.961656351257975, 2.0373902200167624]
['10', 'Good', 'Megakaryocyte', 10.334739958943498, 2.0291785860088662]
['2', '?', 'Monocyte|Macrophage', '14.746713096982123|8.75641509612364', 1.6841039323855624]
['3', 'Good', 'B cell', 13.796626908885715, 4.003778697849947]
['4', '?', 'Natural killer cell|T cell', '8.660190824133249|8.051187833952175', 1.0756413839474572]
['5', '?', 'Monocyte|Macrophage', '13.971414830833528|9.949917068151382', 1.4041739981486407]
['6', 'Good', 'Natural killer cell', 15.59399546035418, 3.656175399110041]
['7', '?', 'T cell|CD8+ T cell', '5.397246597412726|4.148101678034799', 1.3011365237242019]
['8', 'Good', 'Monocyte', 10.95996648839431, 2.170437446718487]
['9', '?', 'Dendritic cell|Monocyte', '8.685332489648657|7.353020857529042', 1.1811924184541662]

We can query only the better annotated results

scsa.cell_auto_anno(adata,key='scsa_celltype_cellmarker')

...cell type added to scsa_celltype_cellmarker on obs of anndata

scRNA-seq SCSA • 456 views

ADD COMMENT • link updated 9 months ago by Ram 44k • written 9 months ago by Julia Ma ▴ 120