Importing zUMI output of single nuclear RNA-seq data to Seurat: Introns, exons, or both?
2
0
Entering edit mode
2.5 years ago
psm ▴ 100

Hello all,

New to sequencing analysis but I have some familiarity with standard analysis pipelines on typical (non-nuclear) RNAseq datasets, bulk and single-cell. But I now need to analyze a public dataset of single nuclear RNA-sequencing data (from this paper https://www.pnas.org/content/116/39/19619). The GEO accession provides R objects for each sample, each of which contains the output from a zUMIs pipeline. The objects are organized into lists: Top level contains UMI and Reads, and each of these contains Intron, exon, and intron-exon.

This may be an obvious, but I don't even know where to begin. What is the difference between "UMI count" and "Read count" in this case? UMI counts are overall lower so I'm assuming that matrix contains only unique reads... is this correct?

Second, what is the difference between the "intron", "exon", and "inex" lists? I would have imagined that "inex" contains both the intron and exon lists, but the number of counts don't quite add up. (intron + exon counts add up to 48.9 million, whereas total inex counts number 47.6 million)

RNA-Seq • 1.5k views
0
Entering edit mode

Hello all, I am working on the same dataset and facing problems with reading the large UMI counts inex sparse matrix for some samples.

> #reading umicount intron exon junction reads from the ZUMI output rds files
dge <- as.matrix(raw_counts$umicount$inex\$all)
> #as.matrix command failed due to large matrix size.
Error in asMethod(object) :    Cholmod error 'problem too large' at file I am


Did you faced a similar issue? How did you fixed it? Greatly appreciate your help. I am new to scRNA-seq and learning the techniques to handle sparse matrices Thanks

1
Entering edit mode
2.5 years ago

If they use UMIs, use UMI counts, not reads. Their methods also says they used unique reads, which implies UMIs, not reads. Might intron-exon refer to reads which span intron-exon junctions?

1
Entering edit mode
2.5 years ago

UMIs are random strings of bases that are present within adapters, and eventually sequenced read. The UMI allows you to exclude reads that are likely PCR duplicates, because the same UMI is not expected by chance to appear in the same position with another read with the same UMI. For scRNA-seq analysis, you usually want to use the UMI counts, as they theoretically represent a more accurate picture of relative gene expression.

They state in the paper only that they identified unique intronic and exonic reads. This leads me to believe that reads spanning an intron-exon junction were placed into that inex list.

0
Entering edit mode

Would it be fair then to simply add the UMIs for intron, exon, and inex to get a final UMI counts matrix for downstream analysis, if I believe all of those to be informative?

0
Entering edit mode

I don't think there would be a problem with combining the counts. They all presumably came from a legitimate transcript that was reverse transcribed. Especially considering this is nuclear RNA-seq, you would expect to capture more introns incidentally anyway, since you are sampling from a pool with enriched nascent transcripts.

0
Entering edit mode