Question: QC metrics for each dataset in Seurat for scRNA-seq
gravatar for mvis1231
12 weeks ago by
mvis123120 wrote:

Hi, I am new to bioinformatics and I have a few silly questions about how to define thresholds for QC metrics in Seurat.

I noticed that Seurat filters cells that have unique feature counts over 2,500 or less than 200. And I have heard that we need to apply a different QC values for each dataset, by looking at the distribution of the number of genes and the distribution of UMIs.

Then, I guess, applying a different threshold for the maximum number of genes to each data means that "the high gene count" is defined relatively for each data. For example, nFeature_RNA < 2000 is a threshold for the max value in A dataset, but nFeature_RNA < 3000 is a threshold for the max value in B dataset (i.e 2000 is not considered as high in B data).

But, I don't understand why we need to set a different upper limit for the number of genes in each data. Is something biologically wrong with the high gene count?

As a non-biology major, I don't understand why we change the upper limit for nFeature (genes) and nCount (UMIs) for each data.

I know this is pretty silly.. but if anyone can explain about the biological concept behind the numbers, I would really appreciate it. Thank you very much.

seurat qc scrna-seq • 264 views
ADD COMMENTlink modified 12 weeks ago by jared.andrews077.1k • written 12 weeks ago by mvis123120
gravatar for jared.andrews07
12 weeks ago by
Memphis, TN
jared.andrews077.1k wrote:

The reason we set an upper limit is because "cells" with abnormally high numbers of genes detected are often actually 2 or more cells stuck together, often of different cell types. Sequencing protocols aren't perfect, nor are automated fluid handlers, so sometimes these doublets (or multiplets) get through. This results in a doublet (or multiplet) rather than a single cell. And this is single cell sequencing, not multiplet sequencing! Those doublet and multiplets confound the clustering and can form small clusters that are confusing to interpret. So people try to remove them, usually by eyeballing the distribution and arbitrarily picking a cutoff. They do the same for low counts, as those typically represent dying cells or empty droplets.

As you've noted, this generally works, but it varies between datasets. It also runs the risk of you accidentally removing legitimate single cell populations that for whatever reason just have slightly higher (or lower) numbers of genes expressed. This is also true even if you don't eyeball it and use a more objective definition of your thresholds, like removing all cells >3 MADs (median absolute deviations) from the median of the total population. This is excellently presented in the pipeComp preprint - Figures 2 & 3.

So in short, using a blanket cutoff across all cells is a rather heavy handed approach. A safer alternative that will help preserve potentially interesting populations is to apply those distribution based thresholds on a cluster by cluster basis, and then check the remaining clusters to ensure none are composed completely of obvious doublets/multiplets. Or you can use an automated approach like the scDblFinder or DoubletFinder packages that attempt to simulate lots of doublets (or even triplets in scDblFinder) from the data and then compare each cell for similarity to the simulations to identify those that may be multiplets. They generally work fairly well, though you should always manually inspect your results after each step to ensure what's being done makes sense.

ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by jared.andrews077.1k

Wow.. This explanation is just amazing... I clearly understand now.. Thank you so much for the details, Jared.

ADD REPLYlink written 12 weeks ago by mvis123120

You're quite welcome. And this question isn't silly at all, particularly given your background. You'll find well-written, clear questions like yours here will often be answered quickly and thoroughly. And now others with the same question will easily be able to find an explanation - this is the entire point of Biostars.

As an aside, you may also find the Orchestrating Single Cell Analysis book by some of the Bioconductor folks very helpful, even if you don't use any Bioconductor packages. It has some very useful contextual info that helps explain many analysis steps like this. It's well-written and easy to understand with many code examples that you can run directly if you'd like.

ADD REPLYlink written 12 weeks ago by jared.andrews077.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1672 users visited in the last hour