Question

Subsetting before QC in Spatial Transcriptomics

1

Entering edit mode

9 weeks ago

Kent ▴ 30

Dear community,

I am working with a spatial transcriptomics data and focusing on rare cell population that makes up only a small fraction of the tissue. Standard quality control filters (e.g., removing low-quality cells or applying minimum gene counts thresholds, as described in common tutorials) may eliminate these cells before downstream analysis.

To address this, I am considering first subsetting cells based on canonical marker genes, and then performing QC, normalization, and clustering only within that subset.

Is this a recommended and feasible approach? What potential pitfalls should I be aware of, and are there better practices for retaining rare populations without introducing bias?

I would greatly appreciate any comments or suggestions. Thank you in advance!

spatial-transcriptomics rare-cell-types quality-control • 7.4k views

ADD COMMENT • link 19 minutes ago by Kent ▴ 30

1

Entering edit mode

Yes, absolutely. It is even (to me) a good practice to do (automated) crude celltype assignment first because QC metrics, such as number of detected genes can vary wildly between celltypes. These initial QCs are just a very crude prefiltering to remove the obvious trash, so if you get some crude celltype spearation first and then per celltype remove the big trash then I don't see how this would introduce bias.

ADD REPLY • link 9 weeks ago by ATpoint 90k

0

Entering edit mode

Hi ATpoint, thank you very much for the helpful insight! I'm still doing some literature reading on this myself, but have you come across any papers or workflows that address this same issue or that apply a similar workflow?

ADD REPLY • link 9 weeks ago by Kent ▴ 30

score 1 · Answer 1 · 2025-11-18

Yes, subsetting cells based on canonical marker genes before applying quality control, normalization, and clustering is a feasible approach, but it is not universally recommended as a first-line strategy. This method can help retain rare populations that might otherwise be filtered out by global thresholds, such as minimum gene counts or unique molecular identifier totals, which often assume uniform cell quality across the dataset. In spatial transcriptomics, where spots may capture multiple cells or low-abundance transcripts, this targeted subsetting aligns with practices for handling heterogeneous tissues.

However, potential pitfalls include introducing selection bias by relying on predefined markers, which may exclude subpopulations with variable or low marker expression. This can lead to incomplete representation of the rare population and inflate false positives in downstream analyses, such as differential expression or pathway enrichment. Additionally, performing quality control only on the subset risks overlooking technical artifacts, like ambient RNA contamination or spatial batch effects, that affect the entire dataset.

Better practices involve applying lenient global quality control first to minimize cell loss, followed by dimensionality reduction and clustering on the full dataset to identify rare clusters organically. Tools like Seurat or Scanpy support this with functions for iterative filtering and rare cell detection. For example, in Seurat, you can use loose thresholds in CreateSeuratObject and then refine with FindClusters while monitoring for rare groups via marker-based scoring. If needed, integrate with methods like Harmony for batch correction to preserve spatial context without prior subsetting. Literature supporting permissive quality control includes guidelines from the single-cell best practices consortium, emphasizing avoidance of overly strict filters to retain subpopulations.

Kevin