which matrix should NMF use in single cell RNA seq data to find diferential gene program?
2
0
Entering edit mode
2.8 years ago
hq.huang11.6 ▴ 10

Hi there,

I'm new to scRNA-seq(use the seurat pipeline to analysis) and nmf.

Recently, I'm going to do nmf in the scRNA-seq to find the diferent programs(like markers for some cells).

But I don't know which matrix should me use to do nmf, normalized counts or scaled counts?

And how to choose the factorization rank in nmf?

Does anyone have experiences? Thanks for your help!

RNA-Seq • 2.7k views
4
Entering edit mode
2.8 years ago

You should use normalized counts. If by scaling you mean the standardization step (scaling to mean 0 and std. dev. 1) performed after normalization then this would not be suitable for NMF because the data would then contain negative values.
Choosing the rank in NMF (and other factorization approaches) is an open problem. There are plenty of heuristics, many of them involve trying out various ranks and assessing which one is best for a given measure of quality. Sometimes one has an idea of the underlying data structure or data generation process that can help make a choice.

0
Entering edit mode

Thanks for your advice!!! scaled mean the standardization step. About the first problem, I just try use the scaled counts, change the negative number to 0, and seems to get the result we want. I'm not sure maybe normalized counts and scaled counts both can as the input of NMF. Maybe the information in scaled counts have been changed after I change the negative number to 0? If you have more comments, welcome to communicate.

0
Entering edit mode

If there are only a few negative values, you could consider them outliers and set them to 0. However, for standardized data, one would expect a significant number of values to be negative and to be meaningful. Discarding them or setting them to 0 means you're considering that values below the mean (i.e. 0) are bad/wrong and setting them to 0, you're artificially increasing the expression level. If doing so produces meaningful results, it suggests there's a lot of noise in the low expression levels. In this case, putting a threshold on the normalized counts should have the same effect.

0
Entering edit mode

3
Entering edit mode
17 months ago
zdebruine ▴ 60

I agree with everything that Jean-Karim has said, but disagree about using normalized counts. I always use raw counts, because NMF naturally scales the data during model fitting. If you normalize, you are confounding a lot of relative signals and changing the nature of the question. See here for a rundown of how well NMF works on raw counts: https://www.biorxiv.org/content/10.1101/2021.09.01.458620v1.