Hi all,
I am analyzing a pbmc scRNA-seq data. While doing velocity analysis using velocyto, I got the majority unspliced (60%), minority spliced (35%) and ambiguous (5%), which is almost consistent across all cell types. This seems way off from the reported 15-25% of unspliced intronic reads in La Manno et al 2018. Also, Tomàs & Anna 2024 discussed about how low or high fractions of intronic reads can be either empty droplets or lysed nuclei without cytosol, respectively, and hence low-quality data needs to be filtered. I wondered whether these statistics indicate something wrong with my analysis method or if I should take this as factual and trust my downstream results.
Also, is there any tool in Python as DropletQC R, to fetch intronic content from each cell? I want to validate velocyto results.
Thanks so much for the help!
Madiha
It’s probable you just have high nucleus content capture. I observed this in one of my preps recently (using a plate-based method, not droplet).
Why not just take your data and do some further analysis (looks at cell types and clustering)? I think that will be informative.
As for RNA velocity, it has always been a bit of a non-rigorous method despite its widespread use, and such models will probably perform worse with nuclei than cells. Go off your cell type assignments to see if your data is usable/useful.
Thinking aloud, wouldn't you be more surprised about the 5% ambiguous here? After all, one can only be certain that a read is spliced if it spans a exon-exon boundary. Exon only is ambiguous because it can come from pre- and mature RNA, while intronic is either intron or intron-exon boundary. Meaning, many more reads span exon-exon boundaries rather than an exon alone, is this not odd? As said, thinking aloud because I have no reference data at hand.
That's not how cellranger->velocyto determine ambiguity. Exon-only is considered spliced by their definition. An ambiguous read by their definition is a read that maps to both an intron of one transcript of the gene and an exon of another transcript of that same gene (there are genomic regions that can be either exonic or intronic depending on the isoform).
The kallisto/bustools 'nac' index and the alevin-fry 'spliceu' index more correctly define exon-only as being ambiguous -- in which case, you'll see 30-60% being ambiguous. As most tools that jointly model the unspliced/spliced (or nascent/mature) RNA quantifications discard the ambiguous count matrix completely, the most common heuristic is to simply assign those ambiguous molecules to the 'spliced' species (which may, of course, not be the correct thing to do in this situation).
Yes, 30-60% ambiguous is what I'd have expected. In their defense, velocyto is old and from "the early days", and it seems we should abandon it entirely.
Thanks for your insight, we have done other analyses, and annotation and clustering seem fine. Doing velocity analysis at the end and based on the above results, we are now doubting if other analyses are correct or need any other filtering.
You are right about velocity analysis, but I also checked the expression of MALAT1, which is considered to be the proxy of introns fraction, and all my cells show very high expression of MALAT1.
OK, then that's good. I wouldn't trust your velocity analysis (for reasons I mentioned in another comment), but your data and data processing seem fine. You can try the analysis anyway and see if it makes sense (there are hundreds of PBMC datasets out there that you can compare to).
Is this based on CellRanger alignments?
yes, it's cellRanger alignment