Question

QC metrics for each dataset in Seurat for scRNA-seq

4

Entering edit mode

3.8 years ago

mvis1231 ▴ 120

Hi, I am new to bioinformatics and I have a few silly questions about how to define thresholds for QC metrics in Seurat.

I noticed that Seurat filters cells that have unique feature counts over 2,500 or less than 200. And I have heard that we need to apply a different QC values for each dataset, by looking at the distribution of the number of genes and the distribution of UMIs.

Then, I guess, applying a different threshold for the maximum number of genes to each data means that "the high gene count" is defined relatively for each data. For example, nFeature_RNA < 2000 is a threshold for the max value in A dataset, but nFeature_RNA < 3000 is a threshold for the max value in B dataset (i.e 2000 is not considered as high in B data).

But, I don't understand why we need to set a different upper limit for the number of genes in each data. Is something biologically wrong with the high gene count?

As a non-biology major, I don't understand why we change the upper limit for nFeature (genes) and nCount (UMIs) for each data.

I know this is pretty silly.. but if anyone can explain about the biological concept behind the numbers, I would really appreciate it. Thank you very much.

scRNA-seq Seurat QC • 9.3k views

ADD COMMENT • link updated 3.0 years ago by K.T • 0 • written 3.8 years ago by mvis1231 ▴ 120

score 19 · Accepted Answer · 2020-06-29

19

Entering edit mode

3.8 years ago

jared.andrews07 ★ 16k

The reason we set an upper limit is because "cells" with abnormally high numbers of genes detected are often actually 2 or more cells stuck together, often of different cell types. Sequencing protocols aren't perfect, nor are automated fluid handlers, so sometimes these doublets (or multiplets) get through. This results in a doublet (or multiplet) rather than a single cell. And this is single cell sequencing, not multiplet sequencing! Those doublet and multiplets confound the clustering and can form small clusters that are confusing to interpret. So people try to remove them, usually by eyeballing the distribution and arbitrarily picking a cutoff. They do the same for low counts, as those typically represent dying cells or empty droplets.

As you've noted, this generally works, but it varies between datasets. It also runs the risk of you accidentally removing legitimate single cell populations that for whatever reason just have slightly higher (or lower) numbers of genes expressed. This is also true even if you don't eyeball it and use a more objective definition of your thresholds, like removing all cells >3 MADs (median absolute deviations) from the median of the total population. This is excellently presented in the pipeComp preprint - Figures 2 & 3.

So in short, using a blanket cutoff across all cells is a rather heavy handed approach. A safer alternative that will help preserve potentially interesting populations is to apply those distribution based thresholds on a cluster by cluster basis, and then check the remaining clusters to ensure none are composed completely of obvious doublets/multiplets. Or you can use an automated approach like the scDblFinder or DoubletFinder packages that attempt to simulate lots of doublets (or even triplets in scDblFinder) from the data and then compare each cell for similarity to the simulations to identify those that may be multiplets. They generally work fairly well, though you should always manually inspect your results after each step to ensure what's being done makes sense.

ADD COMMENT • link 3.8 years ago by jared.andrews07 ★ 16k

1

Entering edit mode

Wow.. This explanation is just amazing... I clearly understand now.. Thank you so much for the details, Jared.

ADD REPLY • link 3.8 years ago by mvis1231 ▴ 120

2

Entering edit mode

You're quite welcome. And this question isn't silly at all, particularly given your background. You'll find well-written, clear questions like yours here will often be answered quickly and thoroughly. And now others with the same question will easily be able to find an explanation - this is the entire point of Biostars.

As an aside, you may also find the Orchestrating Single Cell Analysis book by some of the Bioconductor folks very helpful, even if you don't use any Bioconductor packages. It has some very useful contextual info that helps explain many analysis steps like this. It's well-written and easy to understand with many code examples that you can run directly if you'd like.

ADD REPLY • link 3.8 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

beautifully explained

ADD REPLY • link 3.0 years ago by ektavpathak • 0

0

Entering edit mode

This was quite helpful. I was wondering does 2500 feature counts equate to genes expressed by a cell. And if so, I thought the maximum number of genes that can be expressed by a potential cell was much higher? ( this might be a dumb question)

ADD REPLY • link 3.0 years ago by K.T • 0

2

Entering edit mode

This isn't a dumb question, and it's actually pretty important to understand. nFeature does equate to the number of genes detected in a given cell. The number of genes expressed likely is much higher, but one must keep in mind the relatively low number of reads per cell in 10X experiments (50k is a common target). This results in many more lowly expressed transcripts not getting sequenced at all just because of the stoichiometry involved. 10X scRNA-seq is quite sparse and discrepancies are compensated for by the high number of cells profiled, the idea being that it's no big deal if you don't capture a gene in one cell, as you'll (hopefully) catch it in another. Many genes only have 1-2 reads mapped to them for a given cell. Ultimately, you end up comparing clusters/populations of cells rather than individual cells to each other, so this is okay.

Other single cell platforms (e.g. SmartSeq2) tend to more faithfully capture the transcriptome of each cell, but the tradeoff is that they profile a much lower number of cells (a few hundred).