Question

Seurat3: RNA vs SCT assays for DotPlot

3

Entering edit mode

6.2 years ago

akh22 ▴ 120

HI,

I am bit confused the use of RNA vs SCT assays for DGE analysis, and wondering if anybody who uses Seurat to shed a light. I've been preforming a Seurat3 integration method with SCTranform by simply following their vignette. According to some discussion and the vignette, a Seurat team indicated that the RNA assay (rather than integrated or Set assays) should be used for DotPlot and FindMarkers functions, for comparing and exploring gene expression differences across cell types. But the RNA assay has raw count data while the SCT assay has scaled and normalized data. It seems to me that numbers in the SCT assay are more appropriate for comparing DGE among cell types. Am I missing something ?

Thanks.

scRNAseq Seurat3 • 14k views

ADD COMMENT • link updated 5 months ago by txema.heredia ▴ 280 • written 6.2 years ago by akh22 ▴ 120

0

Entering edit mode

Apologies for resurrecting this old post.

When googling about "Seurat FindAllMarkers SCTransform" I have seen suggested on many forums that the results of running Seurat's FindAllMarkers on the SCT or the RNA slots should give almost identical results.

a Seurat team indicated that the RNA assay (rather than integrated or Set assays) should be used for DotPlot and FindMarkers functions

However, even in 2025, with Seurat v5.2.1, this original comment still stands true, at least in some cases.

I am analyzing a tissue containing a mixture of Smooth Muscle Cells, Fibroblasts, Endothelial, and Immune Cells (widely dominated by SMCs). SCT transformation creates nice and consistent clusters for celltypes and subtypes. Running FindAllMarkers at the celltype level gives very consistent (almost 100% match) results when used in either SCT or RNA slot.

However, when subsetting the data only for a specific cell type, then differences arise. In our case, subsetting the data to only immune cells, and then running FindAllMarkers at the subtype level, created huge differences between using SCT or RNA. After some exploration of the data, one of our subtypes (T-cells cluster#2), in the RNA slot, had significantly lower nFeatures_RNA than other immune subtypes. This caused SCTtransformation to "fill up this data" with a lot of "imputed/inferred/fake gene counts". And I suspect these gene counts are taken from the most abundant cell types, SMCs.

So, when running FindAllMarkers in the subsetted immune data, when run on the SCT slot, "Acta2" (a SMC marker) appears as the #4 top most significant markers in T-cells cluster #2. With p_val_adj = 0, pct.1 = 0.771 and pct.2 = 0.266. This (and many more SMC-associated genes) are nowhere to be found when running FindAllMarkers on the RNA slot.

In the case of Acta2, it has counts>0 in 397/515 (77.08%) cells in the SCT slot. However, it is actually present in 3/515 (0.58%) cells in the RNA slot. In this cell type, this gene's whole signal in the SCT object is made up of artificial counts.

In fact, out of the top20 Markers using the SCT slot, only 3 of them are considered markers at all when running FindAllMarkers on the RNA slot (looking at all significant markers, not only the top20 in RNA).

Let this be a cautionary tale for future people searching about this issue.

ADD REPLY • link 5 months ago by txema.heredia ▴ 280

score 7 · Answer 1 · 2019-08-26

You can also normalize and scale data for the RNA assay. There are numerous resources on this, but Aaron Lun describes why the original log-normalized values should be used for DE and visualizations of expression quite well here:

For gene-based procedures like differential expression (DE) analyses or gene network construction, it is desirable to use the original log-expression values or counts. The corrected values are only used to obtain cell-level results such as clusters or trajectories. Batch effects are handled explicitly using blocking terms or via a meta-analysis across batches. We do not use the corrected values directly in gene-based analyses, for various reasons:

It is usually inappropriate to perform DE analyses on batch-corrected values, due to the failure to model the uncertainty of the correction. This usually results in loss of type I error control, i.e., more false positives than expected.

The correction does not preserve the mean-variance relationship. Applications of common DE methods like edgeR or limma are unlikely to be valid.

Batch correction may (correctly) remove biological differences between batches in the course of mapping all cells onto a common coordinate system. Returning to the uncorrected expression values provides an opportunity for detecting such differences if they are of interest. Conversely, if the batch correction made a mistake, the use of the uncorrected expression values provides an important sanity check.

In addition, the normalized values in SCT and integrated assays don't necessary correspond to per-gene expression values anyway, rather containing residuals (in the case of the scale.data slot for each).