Question

Does SCVI automatically use highly variable genes?

0

Entering edit mode

3.2 years ago

pomodoro_sinensis ▴ 110

According to the SCVI tutorials, it is recommended to pre-select highly variable genes before training the SCVI model. Here is a piece of the code from here: https://docs.scvi-tools.org/en/stable/user_guide/notebooks/harmonization.html

adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata  # keep full dimension safe
sc.pp.highly_variable_genes(
    adata,
    flavor="seurat_v3",
    n_top_genes=2000,
    layer="counts",
    batch_key="batch",
    subset=True

What leaves me confused is that they set subset = True, which means they are not filtering the non-variable genes, they are just marking the highly variable ones. Then, they train the SCVI model:

scvi.data.setup_anndata(adata, layer="counts", batch_key="batch")
vae = scvi.model.SCVI(adata)
vae.train()

How does SCVI know which are highly variable genes and which not? Is it because of the layer counts? Does anybody know if this is because the layer count only contains the highly variable genes or because the layer marks the highly variable genes in a way SCVI understand?

scRNA-seq SCVI Highly variable genes • 1.6k views

ADD COMMENT • link updated 2.5 years ago by valehvpa ▴ 10 • written 3.2 years ago by pomodoro_sinensis ▴ 110

score 1 · Answer 1 · 2021-10-10

Hi and thanks for reaching out. I am a member of scvi-tools and can offer some help. Please also feel free to reach us out on discourse.

The subset = True parameter indicates that we indeed want to filter to highly variable genes. Scanpy will update adata to only contain the highly variable genes (code reference). We then proceed to using the same adata object for the future tasks as you mentioned, such as training.