Question: which matrix should NMF use in single cell RNA seq data to find diferential gene program?
0
5 months ago by
China/Xiamen/Xiamen university
hq.huang11.60 wrote:

Hi there,

I'm new to scRNA-seq(use the seurat pipeline to analysis) and nmf.

Recently, I'm going to do nmf in the scRNA-seq to find the diferent programs(like markers for some cells).

But I don't know which matrix should me use to do nmf, normalized counts or scaled counts?

And how to choose the factorization rank in nmf?

Does anyone have experiences? Thanks for your help!

rna-seq • 382 views
modified 5 months ago by Jean-Karim Heriche23k • written 5 months ago by hq.huang11.60
2
5 months ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche23k wrote:

You should use normalized counts. If by scaling you mean the standardization step (scaling to mean 0 and std. dev. 1) performed after normalization then this would not be suitable for NMF because the data would then contain negative values.
Choosing the rank in NMF (and other factorization approaches) is an open problem. There are plenty of heuristics, many of them involve trying out various ranks and assessing which one is best for a given measure of quality. Sometimes one has an idea of the underlying data structure or data generation process that can help make a choice.

Thanks for your advice!!! scaled mean the standardization step. About the first problem, I just try use the scaled counts, change the negative number to 0, and seems to get the result we want. I'm not sure maybe normalized counts and scaled counts both can as the input of NMF. Maybe the information in scaled counts have been changed after I change the negative number to 0? If you have more comments, welcome to communicate.

If there are only a few negative values, you could consider them outliers and set them to 0. However, for standardized data, one would expect a significant number of values to be negative and to be meaningful. Discarding them or setting them to 0 means you're considering that values below the mean (i.e. 0) are bad/wrong and setting them to 0, you're artificially increasing the expression level. If doing so produces meaningful results, it suggests there's a lot of noise in the low expression levels. In this case, putting a threshold on the normalized counts should have the same effect.