Recommended way to normalize SmartSeq2 gene expression matrix to better match 10X expression data
Entering edit mode
10 days ago
Cookin ▴ 10


I have single-cell RNA sequencing data from similar tissue, one dataset collected with SmartSeq2 (full transcript length, no UMIs), and another dataset collected from 10X (3' end, with UMIs).

I am doing a standard log(x/n + 1) normalization for the 10X data. However, for the SmartSeq, I am unsure how to normalize the data. Should I correct for gene-length bias? When I try log(x/n +1) for SmartSeq2, I get significant differences in gene expression between 10X and Smart-Seq.

My goal is to integrate the 10X and Smart-Seq datasets and perform clustering. I'd like the two datasets to match as closely as possible before integration. I have a count matrix for each (rows are genes, columns are cells).

Basically, what is the recommended way to normalize SmartSeq2 expression data?


rna-seq smartseq2 r • 439 views
Entering edit mode

I don't believe it would be advisable to compare low-throughput and high-throughput data. The gene cover is going to be completely different, and you're going to lose information, at best.

But if you'd like to do it anyway, I suppose you would reduce your gene coverage to match what you get for your 10x data and then re-scale it, though I believe that will create abnormal patterns. I don't believe that will make up for the difference in the treatment of amplification biases, which the 10x technology does pretty well unlike smart-seq2 technology.

To go even further, I think most of the difference you're seeing comes from amplification bias.

Entering edit mode

I agree with yhdist. I've looked into this a lot and my advice would simply be to not combine them just like yhdist said.

There are many differences (including different technical biases) between 10x and smartseq -- and, honestly speaking, I still don't know where a lot of the technical biases present in each technology arise from (I don't think anyone really knows). You could use a batch integration tool (like Harmony) but I'd recommend against it -- you won't really gain any new info and will probably just end up butchering your biological signal. I'm sure you'd get a pretty good methods paper published if you really discover an optimal way to use both technologies with one another.

What I'd recommend: Just look at them separately! You'll probably see the same cell types in both if they're from the same tissue. And, maybe with smart-seq, you can use transcript-level information to further resolve your gene-level cell types.


Login before adding your answer.

Traffic: 1796 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6