Unexpected spliced/unspliced percentages in scRNA-seq data
1
1
Entering edit mode
9 weeks ago
Madiha ▴ 10

Hi all,

I am analyzing a pbmc scRNA-seq data. While doing velocity analysis using velocyto, I got the majority unspliced (60%), minority spliced (35%) and ambiguous (5%), which is almost consistent across all cell types. This seems way off from the reported 15-25% of unspliced intronic reads in La Manno et al 2018. Also, Tomàs & Anna 2024 discussed about how low or high fractions of intronic reads can be either empty droplets or lysed nuclei without cytosol, respectively, and hence low-quality data needs to be filtered. I wondered whether these statistics indicate something wrong with my analysis method or if I should take this as factual and trust my downstream results.

Also, is there any tool in Python as DropletQC R, to fetch intronic content from each cell? I want to validate velocyto results.

Thanks so much for the help!

Madiha

seq Spliced unspliced analysis velocity data scRNA • 695 views
ADD COMMENT
1
Entering edit mode

It’s probable you just have high nucleus content capture. I observed this in one of my preps recently (using a plate-based method, not droplet).

Why not just take your data and do some further analysis (looks at cell types and clustering)? I think that will be informative.

As for RNA velocity, it has always been a bit of a non-rigorous method despite its widespread use, and such models will probably perform worse with nuclei than cells. Go off your cell type assignments to see if your data is usable/useful.

ADD REPLY
0
Entering edit mode

Thinking aloud, wouldn't you be more surprised about the 5% ambiguous here? After all, one can only be certain that a read is spliced if it spans a exon-exon boundary. Exon only is ambiguous because it can come from pre- and mature RNA, while intronic is either intron or intron-exon boundary. Meaning, many more reads span exon-exon boundaries rather than an exon alone, is this not odd? As said, thinking aloud because I have no reference data at hand.

ADD REPLY
3
Entering edit mode

That's not how cellranger->velocyto determine ambiguity. Exon-only is considered spliced by their definition. An ambiguous read by their definition is a read that maps to both an intron of one transcript of the gene and an exon of another transcript of that same gene (there are genomic regions that can be either exonic or intronic depending on the isoform).

The kallisto/bustools 'nac' index and the alevin-fry 'spliceu' index more correctly define exon-only as being ambiguous -- in which case, you'll see 30-60% being ambiguous. As most tools that jointly model the unspliced/spliced (or nascent/mature) RNA quantifications discard the ambiguous count matrix completely, the most common heuristic is to simply assign those ambiguous molecules to the 'spliced' species (which may, of course, not be the correct thing to do in this situation).

ADD REPLY
1
Entering edit mode

Yes, 30-60% ambiguous is what I'd have expected. In their defense, velocyto is old and from "the early days", and it seems we should abandon it entirely.

ADD REPLY
0
Entering edit mode

Thanks for your insight, we have done other analyses, and annotation and clustering seem fine. Doing velocity analysis at the end and based on the above results, we are now doubting if other analyses are correct or need any other filtering.

You are right about velocity analysis, but I also checked the expression of MALAT1, which is considered to be the proxy of introns fraction, and all my cells show very high expression of MALAT1.

ADD REPLY
0
Entering edit mode

OK, then that's good. I wouldn't trust your velocity analysis (for reasons I mentioned in another comment), but your data and data processing seem fine. You can try the analysis anyway and see if it makes sense (there are hundreds of PBMC datasets out there that you can compare to).

ADD REPLY
0
Entering edit mode

Is this based on CellRanger alignments?

ADD REPLY
0
Entering edit mode

yes, it's cellRanger alignment

ADD REPLY
0
Entering edit mode
9 weeks ago
predeus ★ 2.1k

This could be because of the real length of the experiment. Depending on the individual read length and paired- or signle-end mapping, these values can change quite a bit.

What exact protocol is the scRNA-seq dataset?

ADD COMMENT

Login before adding your answer.

Traffic: 1508 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6