Pre-processing for single cell RNAseq: Hard thresholds, data (cluster)-driven or both?
0
0
Entering edit mode
12 days ago
psm ▴ 100

Prior to analyzing single cell RNAseq datasets, I typically employ a pipeline where I look at QC metrics, apply some hard filter, cluster the data, then re-examine the QC metrics, and update the hard filter.

I'm now wondering whether I should just throw out an entire crappy cluster rather than updating the filter, as making the filter more stringent will throw out cells from the "good" clusters.

Which got me thinking... should I even be applying hard filters if computational speed (and personal time) is not limiting? Why not just iteratively cluster and cull?

Curious to know what other people's thoughts and strategies are :) Thanks in advance!

scRNA-seq • 493 views
ADD COMMENT
2
Entering edit mode

I agree with Ram's points. Computational efficiency is not the reason for filtering; the data quality is. To further discuss your strategy, could you define "crappy cluster" and elaborate on your grounds?

Given your description, you should only discard the cluster if 100% of the cells are below your hard filter. Otherwise, you are removing a group of cells different from the remains and losing biological meanings. Depending on the clustering algorithm you choose, removing a cluster with good-quality cells might exaggerate the difference between the remaining cells.

Also, for the standard QC filter, you want to remove (1) duplets/triplets and empty droplets and (2) potentially cells with high mitochondrial/ribosomal reads. I am unsure why you need to adjust the complex filter for (1) because they should be removed. In the case of (2), you might have a group of cells with high mitochondrial/ribosomal reads because those cells are under stress or dying, which is biologically relevant to your experiment.

ADD REPLY
0
Entering edit mode

Thanks - I really appreciate the reply. I realize that we set the hard filter for quality as well. I guess in my usual workflow, the "hard" filter is usually not fully determined a priori - I usually only set filters after looking at the distributions anyway. I usually do this for all QC metrics, particularly number of features/cell and % mitochondrial reads. That's why I'm wondering whether it makes more sense to take a completely data-driven approach - let the poor QC clusters declare themselves.

To elaborate - what I've started testing out is performing no QC filtering up front, and then pre-processing/ clustering the data. I notice that when I plot total counts/barcode and number of features/barcode, there already appear to be two clusters . I think I see a population that should be filtered out, but hard to do it with simple hard thresholds.

Number of features/cell vs total counts/cell

I find that clusters with low numbers of features or high % mitochondrial reads usually largely cluster together in a fairly significant way if I do a simple clustering workflow without pre-filtering. I do see some overlapping tails between clusters but biologically, I think that makes sense. However, not sure if people agree which is why I decided to post here :)

Fundamentally, it comes down to idea that single cell experiments profile heterogeneous populations of cells. Why would "poor quality"/dead cells need to follow hard thresholds across cell types? It makes sense that there might be overlap. Since we use the data to guide our thresholding decision anyway, there appears to be no "absolute truth" to the threshold that constitutes a good vs bad cell. So why not let the data decide for us?

What do you think?

nFeatures after clustering nCounts after clustering

ADD REPLY
1
Entering edit mode

AFAIK, clusters group cells by biological similarity, not computational quality. If a cluster were to contain mostly necrotic cells, you may be able to discard that cluster but understand that you're doing it for biological reasons.

Throw out cells from "good" clusters = ignore possibly insignificant GEMs from clusters, which is not a bad thing. Don't start throwing out clusters unless you have an idea why the cluster clustered together.

ADD REPLY
0
Entering edit mode

Thanks - totally agree on all these points. I was thinking of re-examining cluster-level QC metrics to determine whether to throw them out. And yes - this assumes that there are biologic features that correlate with barcode quality. So that if have a few cells with "good" metrics but that cluster along with barcodes with obviously bad metrics, these are likely bad. Similarly, borderline barcodes that cluster with "good QC" barcodes may be biologically "good" cells that, for whatever reason, don't quite meet the thresholds I've set.

Does this seem reasonable?

ADD REPLY
1
Entering edit mode

I'll wait for other to chime in, but your approach seems a little restrictive to me.

ADD REPLY

Login before adding your answer.

Traffic: 1820 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6