Recommended way to normalize SmartSeq2 gene expression matrix to better match 10X expression data
Entering edit mode
8 months ago
Cookin ▴ 10


I have single-cell RNA sequencing data from similar tissue, one dataset collected with SmartSeq2 (full transcript length, no UMIs), and another dataset collected from 10X (3' end, with UMIs).

I am doing a standard log(x/n + 1) normalization for the 10X data. However, for the SmartSeq, I am unsure how to normalize the data. Should I correct for gene-length bias? When I try log(x/n +1) for SmartSeq2, I get significant differences in gene expression between 10X and Smart-Seq.

My goal is to integrate the 10X and Smart-Seq datasets and perform clustering. I'd like the two datasets to match as closely as possible before integration. I have a count matrix for each (rows are genes, columns are cells).

Basically, what is the recommended way to normalize SmartSeq2 expression data?


rna-seq smartseq2 r • 972 views
Entering edit mode

I don't believe it would be advisable to compare low-throughput and high-throughput data. The gene cover is going to be completely different, and you're going to lose information, at best.

But if you'd like to do it anyway, I suppose you would reduce your gene coverage to match what you get for your 10x data and then re-scale it, though I believe that will create abnormal patterns. I don't believe that will make up for the difference in the treatment of amplification biases, which the 10x technology does pretty well unlike smart-seq2 technology.

To go even further, I think most of the difference you're seeing comes from amplification bias. (And you likely won't be able to evaluate the source of the source of the difference beyond that.)

Entering edit mode

I agree with yhdist. I've looked into this a lot and my advice would simply be to not combine them just like yhdist said.

There are many differences (including different technical biases) between 10x and smartseq -- and, honestly speaking, I still don't know where a lot of the technical biases present in each technology arise from (I don't think anyone really knows). You could use a batch integration tool (like Harmony) but I'd recommend against it -- you won't really gain any new info and will probably just end up butchering your biological signal. I'm sure you'd get a pretty good methods paper published if you really discover an optimal way to use both technologies with one another.

What I'd recommend: Just look at them separately! You'll probably see the same cell types in both if they're from the same tissue. And, maybe with smart-seq, you can use transcript-level information to further resolve your gene-level cell types.


Login before adding your answer.

Traffic: 1972 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6