I've been working and analyzing single-cell RNAseq datasets recently. When I reach the point of normalized counts matrices for different cell types (after filtering for doublets) I face a question that I can't answer:
What is a good criterion to say (and subsequently filter) that a transcript is expressed? This is because I have many transcripts that have 1, 2, or 3 counts and I wonder if I should consider them as being expressed. This is droplet-based scRNA-seq (10x).
I know that single-cell data is sparse, and I have heard that some people are satisfied even with one count and consider the transcript expressed, but I am wondering if there is an analytical way to approach this.
- maybe check out the distribution of counts for all the transcripts and select a percentage threshold?
- hard filters like saying "minimum 5 reads" or something like that?
- perhaps it depends on the specific dataset? some cell types should show more overall expression than others?
- if a transcript has 1 count but it has 1 count in most of the cells (i.e. a percentage) of its cell type, then it can be considered an expressed transcript at super low levels?
Those are some ideas I've thought of, but this is killing me, I need to arrive at a decision/criterion to proceed with my analysis.
Hopefully, someone knows more about this problem than me.