Prior to analyzing single cell RNAseq datasets, I typically employ a pipeline where I look at QC metrics, apply some hard filter, cluster the data, then re-examine the QC metrics, and update the hard filter.
I'm now wondering whether I should just throw out an entire crappy cluster rather than updating the filter, as making the filter more stringent will throw out cells from the "good" clusters.
Which got me thinking... should I even be applying hard filters if computational speed (and personal time) is not limiting? Why not just iteratively cluster and cull?
Curious to know what other people's thoughts and strategies are :) Thanks in advance!
I agree with Ram's points. Computational efficiency is not the reason for filtering; the data quality is. To further discuss your strategy, could you define "crappy cluster" and elaborate on your grounds?
Given your description, you should only discard the cluster if 100% of the cells are below your hard filter. Otherwise, you are removing a group of cells different from the remains and losing biological meanings. Depending on the clustering algorithm you choose, removing a cluster with good-quality cells might exaggerate the difference between the remaining cells.
Also, for the standard QC filter, you want to remove (1) duplets/triplets and empty droplets and (2) potentially cells with high mitochondrial/ribosomal reads. I am unsure why you need to adjust the complex filter for (1) because they should be removed. In the case of (2), you might have a group of cells with high mitochondrial/ribosomal reads because those cells are under stress or dying, which is biologically relevant to your experiment.
Thanks - I really appreciate the reply. I realize that we set the hard filter for quality as well. I guess in my usual workflow, the "hard" filter is usually not fully determined a priori - I usually only set filters after looking at the distributions anyway. I usually do this for all QC metrics, particularly number of features/cell and % mitochondrial reads. That's why I'm wondering whether it makes more sense to take a completely data-driven approach - let the poor QC clusters declare themselves.
To elaborate - what I've started testing out is performing no QC filtering up front, and then pre-processing/ clustering the data. I notice that when I plot total counts/barcode and number of features/barcode, there already appear to be two clusters . I think I see a population that should be filtered out, but hard to do it with simple hard thresholds.
I find that clusters with low numbers of features or high % mitochondrial reads usually largely cluster together in a fairly significant way if I do a simple clustering workflow without pre-filtering. I do see some overlapping tails between clusters but biologically, I think that makes sense. However, not sure if people agree which is why I decided to post here :)
Fundamentally, it comes down to idea that single cell experiments profile heterogeneous populations of cells. Why would "poor quality"/dead cells need to follow hard thresholds across cell types? It makes sense that there might be overlap. Since we use the data to guide our thresholding decision anyway, there appears to be no "absolute truth" to the threshold that constitutes a good vs bad cell. So why not let the data decide for us?
What do you think?
AFAIK, clusters group cells by biological similarity, not computational quality. If a cluster were to contain mostly necrotic cells, you may be able to discard that cluster but understand that you're doing it for biological reasons.
Throw out cells from "good" clusters = ignore possibly insignificant GEMs from clusters, which is not a bad thing. Don't start throwing out clusters unless you have an idea why the cluster clustered together.
Thanks - totally agree on all these points. I was thinking of re-examining cluster-level QC metrics to determine whether to throw them out. And yes - this assumes that there are biologic features that correlate with barcode quality. So that if have a few cells with "good" metrics but that cluster along with barcodes with obviously bad metrics, these are likely bad. Similarly, borderline barcodes that cluster with "good QC" barcodes may be biologically "good" cells that, for whatever reason, don't quite meet the thresholds I've set.
Does this seem reasonable?
I'll wait for other to chime in, but your approach seems a little restrictive to me.