Yes, subsetting cells based on canonical marker genes before applying quality control, normalization, and clustering is a feasible approach, but it is not universally recommended as a first-line strategy. This method can help retain rare populations that might otherwise be filtered out by global thresholds, such as minimum gene counts or unique molecular identifier totals, which often assume uniform cell quality across the dataset. In spatial transcriptomics, where spots may capture multiple cells or low-abundance transcripts, this targeted subsetting aligns with practices for handling heterogeneous tissues.
However, potential pitfalls include introducing selection bias by relying on predefined markers, which may exclude subpopulations with variable or low marker expression. This can lead to incomplete representation of the rare population and inflate false positives in downstream analyses, such as differential expression or pathway enrichment. Additionally, performing quality control only on the subset risks overlooking technical artifacts, like ambient RNA contamination or spatial batch effects, that affect the entire dataset.
Better practices involve applying lenient global quality control first to minimize cell loss, followed by dimensionality reduction and clustering on the full dataset to identify rare clusters organically. Tools like Seurat or Scanpy support this with functions for iterative filtering and rare cell detection. For example, in Seurat, you can use loose thresholds in CreateSeuratObject and then refine with FindClusters while monitoring for rare groups via marker-based scoring. If needed, integrate with methods like Harmony for batch correction to preserve spatial context without prior subsetting. Literature supporting permissive quality control includes guidelines from the single-cell best practices consortium, emphasizing avoidance of overly strict filters to retain subpopulations.
Kevin
Yes, absolutely. It is even (to me) a good practice to do (automated) crude celltype assignment first because QC metrics, such as number of detected genes can vary wildly between celltypes. These initial QCs are just a very crude prefiltering to remove the obvious trash, so if you get some crude celltype spearation first and then per celltype remove the big trash then I don't see how this would introduce bias.
Hi ATpoint, thank you very much for the helpful insight! I'm still doing some literature reading on this myself, but have you come across any papers or workflows that address this same issue or that apply a similar workflow?