Question

SMARTseq2 scRNAseq and gene length normalization

1

Entering edit mode

2.1 years ago

gregoire.destreel ▴ 20

Hi !

I'm wondering whether its is actually correct to compare the level of expression of genes in scRNAseq produced from SMARTseq2 full length protocol. When you are using UMIs (such as in 10x Genomics pipeline), you can truly estimate the number of mRNA molecules that account for each gene. But with SMARTseq2, you are only able to get reads (no UMIs) and thus, the probability to get a lot of reads for one given gene is higher for very long gene by comparison to very short genes... exactly such as in bulk RNAseq. So you should normalize by the gene length then.. However in several normalisation methods for scRNAseq (such as scran/scater) you don't normalize by the gene length.. Why is that? Does this preclude comparison of level of expression of genes in a given data set?

Thanks for your help !

Differential Normalization Expression SMARTseq2 • 988 views

ADD COMMENT • link updated 2.1 years ago by dsull ★ 5.8k • written 2.1 years ago by gregoire.destreel ▴ 20

score 2 · Answer 1 · 2022-03-09

In theory, you should correct Smart-seq2 data by dividing the transcript counts by the effective length otherwise longer transcripts will get higher counts than smaller transcripts for the same amount of expression. UMIs (in theory) take care of this problem so in UMI-based data, this procedure should not be used.

Many scRNAseq tools are designed for 10X-type data.

The default method for running kallisto on smart-seq2 data normalizes by dividing each transcript by the effective length.

(Note: I say "in theory" because real data is messy and the "best" way to normalize hasn't been figured out yet -- it's still an active area of research. E.g. I've seen some UMI-based dataset exhibit length bias even though UMIs, in theory, are supposed to correct for that).