I have single cell Smart-Seq 3 data. The zUMIs pipeline produces about 19 different output count matrices:
umicount
exon
all
downsampled
inex
all
downsampled
intron
all
downsampled
readcount
exon
all
downsampled
inex
all
downsampled
intron
all
downsampled
readcount_internal
exon
all
downsampled
inex
all
downsampled
intron
all
downsampled
rpkm
exon
all
Read counts from Exons are the traditional RNASeq counts that one would expect. UMI Counts from exons are the equivalent corrected for PCR duplication. So this will have lower counts compared to reads. The zUMIs paper claims that intron+exon counts improves clustering.
i am interested in views from the experts. What are downsampled counts? What is readcount_internal? What are the implications of using intron+exon? Should rpkm be used at all? Which dataset should be used for differential gene expression? What are the pros and cons of different types of data? What considerations should one keep in mind?