Question

How to interpret ScanPy scatter plots for QC filtering?

0

Entering edit mode

2.9 years ago

Pratik ★ 1.0k

Hey ya'll,

I'm working on a scRNA-seq project using publicly available data in ScanPy. I am stuck on, I guess, a QC step of filtering out cells. These scatter plots were generated.

scatter plot

I'm having trouble interpreting why there's two bunches of cells in the bottom graph? Especially the bottom bunch with low n_gene_by_counts and higher total_count? Anyone have a clue or idea what they could be? or how to look into them further? Help, please?

Could someone explain how to interpret these graphs, please?

Python jupyter-lab ScanPy RNA-Seq scRNA-seq • 2.2k views

ADD COMMENT • link 2.9 years ago by Pratik ★ 1.0k

1

Entering edit mode

Cells with many counts but very few genes, maybe damaged cells with poor capture of transcripts. Can you check whether these are ribosomal genes that are on the separating there on the bottom of plot 2?

ADD REPLY • link 2.9 years ago by ATpoint 81k

0

Entering edit mode

Thank you ATpoint

This really helped me learn more about ScanPy.

This tutorial helped me too:

https://nbisweden.github.io/workshop-scRNAseq/labs/compiled/scanpy/scanpy_01_qc.html

So this is the percentage of counts for ribosomal genes and hemoglobin genes:

enter image description here

From your experience, where would you make the cut off for this dataset? It's a human fetal pancreas dataset.

I did the cut-off like so:

adata = adata[adata.obs.n_genes_by_counts < 4000, :]
# filter for percent mito
adata = adata[adata.obs['pct_counts_mt'] < 20, :]
# filter for percent ribo > 0.05
adata = adata[adata.obs['pct_counts_ribo'] < 10, :]
# filter for percent hemo
adata = adata[adata.obs['pct_counts_hb'] < 20, :]

print("Remaining cells %d"%adata.n_obs)

Remaining cells 156

But I went from an original ~9000 cells to 156 cells! I guess there was actually this much damage?

ADD REPLY • link 2.9 years ago by Pratik ★ 1.0k

score 3 · Accepted Answer · 2021-05-24

Filtering for ribosomal read percentage is relatively uncommon and not a particularly good idea, imo, given that those genes can vary widely depending on cell state (e.g. if cells are proliferating heavily). I very much doubt that large a proportion of your cells are damaged/low quality given the mitochondrial read percentages and number of genes/reads per cell. At a glance, this looks like good quality data.

I have an answer to another question that may be a helpful read for you as well. In short, using arbitrary cutoffs can have some unwanted side-effects, and there are a few more nuanced approaches that may work better. The OSCA book also has a great QC chapter that will be a good read even if you aren't using Bioconductor packages.