Question

Why NMF for mutation signature analysis

3

Entering edit mode

5.5 years ago

CY ▴ 750

I have seen mutation signature analysis and they are always done using NMF. I am a bit new to this. Why people always choose NMF for such analysis? Is there an alternative for this?

I also found that NMF is usually used for mutation signature and SVD usually for expression signature. Any biological reason behind this?

mutation NMF signature • 4.4k views

ADD COMMENT • link 5.5 years ago by CY ▴ 750

score 6 · Answer 1 · 2018-11-14

6

Entering edit mode

5.5 years ago

Min Dai ▴ 160

The mutational profile is naturally nonnegative. You can regard the latent k components as a combination of genes (i.e. metagene).

NMF can help you see which "parts" of genes function in which class of patients. In the case of face recognition, NMF can help you identify intuitional parts of faces, like mouths, eyes and noses.

Further, you can conveniently add regularization term to the normal NMF, in order to integrate useful information (e.g. PPI network or known relationships between patients) to the factorization process.

At last, you can try R packages including NMF or NNLM.

ADD COMMENT • link 5.5 years ago by Min Dai ▴ 160

0

Entering edit mode

I indeed read some methods using LASSO to enhance the sparsity although I am not sure about the biology behind the spasity assumption. Besides, I know that SVD is usually used for gene expression signature. Why NMF for mutation signature and SVD for expression signature? Any biological reason for this?

ADD REPLY • link 5.5 years ago by CY ▴ 750

0

Entering edit mode

My thinking is: Sparsity can help you interpret the biological meaning for the metagenes, because only a few numbers of coefficients are positive and it helps you better understand the function of that group of genes. You can think of the expression profile or mutation profile as the output of some intrinsic biological processes. Maybe, one metagene is corresponded to one or two pathways, or a subnetwork in PPI network or gene regulation network.

There are other methods except for LASSO to deal with sparsity, e.g. L0-norm, and also exist sparse version for PCA and SVD.

I think NMF is not limited to mutation signature and it can surely well function in gene expression analysis. For example, a classic paper introduced NMF to analyzing gene expression matrix: Metagenes and molecular pattern discovery using matrix factorization (https://www.ncbi.nlm.nih.gov/pubmed/15016911) and also a recently published paper: Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations (https://www.ncbi.nlm.nih.gov/pubmed/29987051).

In addition, the latent components generated by NMF are not required to be orthogonal, and this is different from PCA and SVD. ICA (independent component analysis) is, to some extent, similar to NMF and you can have a look at this.

ADD REPLY • link 5.5 years ago by Min Dai ▴ 160

0

Entering edit mode

Thanks for explaining. I had some digging. Some methods explain the use of sparse solution is that most mutagens are highly specific in the type of damage they cause, and therefore the majority of somatic mutational signatures are sparse.

ADD REPLY • link 5.5 years ago by CY ▴ 750

0

Entering edit mode

Thanks. It makes sense now.

ADD REPLY • link 5.5 years ago by Min Dai ▴ 160

score 2 · Answer 2 · 2018-11-14

2

Entering edit mode

5.5 years ago

Dawe ▴ 270

Basically any blind sourc separation method should work. The reason people use NMF is probably because it is simple and effective. BTW, you may find criticism and different flavors here

https://www.biorxiv.org/content/early/2018/08/04/384834

d

ADD COMMENT • link 5.5 years ago by Dawe ▴ 270

0

Entering edit mode

Thx. I read the ariticle. Why such method emphasize sparsity (even using LASSO to enhance it)? What is the biology behind this assumption?

ADD REPLY • link 5.5 years ago by CY ▴ 750

score 2 · Answer 3 · 2018-11-14

Sort of complementing @Minstein's answer, there is a nice visual comparison of NMF, PCA and k-means clustering in figure 14.33 (page 555 and paragraphs around it) of Elements of Statistical Learning (pdf is freely available). You should be able to transpose the message to more bioinformatics questions.