Question

Closed:Introducing M3C: Monte Carlo reference-based consensus clustering

1

Entering edit mode

4.4 years ago

chris86 ▴ 400

Hello. I would like to introduce a cluster analysis tool we developed at UCL and QMUL over the last 4 years, to find the number of clusters when consensus clustering. It is called Monte Carlo reference-based consensus clustering or M3C https://bioconductor.org/packages/release/bioc/html/M3C.html.

Consensus clustering (the Monti variant) works by resampling and clustering the data for each K (number of clusters) and a NXN consensus matrix is calculated for each K, where each element represents the fraction of times two samples clustered together. A perfectly stable matrix would consist entirely of 0s and 1s, representing all sample pairs always clustering together or not together over resampling iterations. The next step is to compare the stability of these consensus matrices to select K.

The problem was we noticed that this type of consensus clustering algorithm makes consensus matrices that suggest improving stability for increasing K by chance alone. Without taking this into account commonly used metrics such as the PAC score are subject to substantial bias. So we use a Monte Carlo simulation to generate null distributions of stability scores which are used to correct for the chance expectation and test the null hypothesis K=1. This method improves the sensitivity to detect real structure in noisy Gaussian datasets, such as those from the TCGA.

We also provide an entropy objective function in the algorithm (binary information entropy). This works directly on the consensus matrix probabilities instead of calculating a CDF first before calculating scores from this. This is a more mathematically elegant and appropiate method than others. The aim is then to find the consensus matrix with the minimal uncertainty or entropy (greater stability during resampling). We have also included the PAC score as a second objective function in the package.

Usage:

library(M3C)
M3C(mydata,iters=25)

Now this is entropy corrected for the chance expectation (RCSI) and we look for a maximum value, instead of elbows or other subjective approaches, which is a lot easier and removes the internal bias of the algorithm. P values are also calculated for each K with M3C.

We suggest examining entropy (or PAC score), the RCSI, and p values when deciding K. Some other recommendations on parameter settings and good practice are:

The default number of Monte Carlo iterations is 25, as a compromise between speed and reliability, this could be increased to 100 for more accurate p values and smaller confidence intervals for the RCSI.
The default number of inner replications used to resample the data is 100, results are generally stable with this number, but again increasing to say, 250 will lead to better stability on some datasets.
Preferably follow up the analysis by applying SigClust for every pair of clusters to help confirm the structure.
Use dimensionality reduction methods such as PCA and t-SNE to confirm structure and make sure there are no problematic results (batch effects, outliers, etc).

Another single-view cluster analysis tool we have had good results with is called CLEST, implemented in the RSKC package, this could be used as a supporting method.

John, Christopher R., et al. "M3C: Monte Carlo reference-based consensus clustering." bioRxiv (2019): 377002.

https://github.com/crj32/M3C

RNA-Seq next-gen R sequencing Tool • 391 views

ADD COMMENT • link 4.2 years ago by chris86 ▴ 400