Question: how to choose optimal K in Consensus clustering
3
LJ180 wrote:

Hi all, I am trying choose best k from the consensus clustering using the "ConsensusClusterPlus" R package.And the result images are ：

k=2,heatmaps of the consensus matrices

k=3,heatmaps of the consensus matrices

k=4,heatmaps of the consensus matrices

k=5,heatmaps of the consensus matrices

CDF plot

tracking plot

Delta Area Plot

So which k should i choose? 2 or 3? And it is difficult to find the true k only with the CDF plot,So apart from the method of checking out the heatmaps of the consensus matrices and the CDF plot, what are some other reliable ways we can choose for finding the optimal k in consensus clustering?

consensus clustering R cluster • 6.6k views
modified 3.0 years ago by Chih-Hsu Jack Lin60 • written 3.4 years ago by LJ180

I am seeing the images when I click the links and did not need to sign up.

This is a question that comes up a lot. Could you re-post the images?

6
Chih-Hsu Jack Lin60 wrote:

PAC has been shown to outperform other K-estimating methods (e.g., ) in this paper and this paper

Dr. Yasin Şenbabaoğlu has kindly provided the R implementation of PAC . You can use the results in ConsensusClusterPlus as input to get optimal K based on minimum PAC. The code is from here.

``````########################################################
seed=11111
d = matrix(rnorm(200000,0,1),ncol=200) # 200 samples in columns, 1000 genes in rows
colnames(d) = paste("Samp",1:200,sep="")
rownames(d) = paste("Gene",1:1000,sep="")
d = sweep(d,1, apply(d,1,median,na.rm=T))
maxK = 6 # maximum number of clusters to try
results = ConsensusClusterPlus(d,maxK=maxK,reps=50,pItem=0.8,pFeature=1,title="test_run",

# Note that we implement consensus clustering with innerLinkage="complete".
# We advise against using innerLinkage="average" which is the default value in this package as average linkage is not robust to outliers.

############## PAC implementation ##############
Kvec = 2:maxK
x1 = 0.1; x2 = 0.9 # threshold defining the intermediate sub-interval
PAC = rep(NA,length(Kvec))
names(PAC) = paste("K=",Kvec,sep="") # from 2 to maxK
for(i in Kvec){
M = results[[i]]\$consensusMatrix
Fn = ecdf(M[lower.tri(M)])
PAC[i-1] = Fn(x2) - Fn(x1)
}#end for i
# The optimal K
optK = Kvec[which.min(PAC)]
########################################################
``````
1

Are you aware of any studies that have validated the PAC as the most accurate K-estimating measure?

I've seen countless studies proposing a new method for X which appear to show that the new method outperforms the old method, only for that study to be refuted in the future.

Unless the PAC has been validated in newer studies, I would hesitate to replace tried and true measures (such as the alternatives mentioned in the PAC paper) with the PAC method.

Additionally -- as a comment below states , no matter what method you use for choosing K, you must determine whether this K makes sense for the question you are trying to answer and the context of the study. For instance, If I am trying to determine molecular subtypes of diseases X, and the PAC method tells me that there are 10 subtypes, but we can find no clear defining features between subtypes 9 + 10, then there is really no point in saying there are 10 subtypes, because we can only support 9. Hope that makes sense.

EDIT: forgot to mention that in my own personal experience the PAC tends to overestimate the number of clusters -- which is why I am skeptical of the PAC until it is proven in other studies.

2

The PAC method overestimates K and declares false positives (ref below). The authors below propose using the RCSI, which is a derivative of the PAC score, but uses a Monte Carlo reference procedure like the GAP-statistic does, this prevents overestimation. My advice, try a few methods and see what works best on your data and what seems to make sense to you. We have made a Gaussian cluster simulator to test clustering algorithms in R here (https://cran.r-project.org/web/packages/clusterlab/index.html).

The problem with the delta k method in the Monti consensus clustering algorithm is locating the optimal K visually by looking at the delta K plot is subjective. Sometimes people look for elbows or the last value before the floor, it hasn't been well defined what to look for. It is less subjective if we are looking for a maxima or minima as in the M3C algorithm. The delta K method also does not work for higher values of K because the CDF converges. To see this, simply simulate increasing numbers of clusters and test the Monti algorithm on them and you will see the problem.

Can you tell me how are you getting the output im running the consensusclusterPlus library but i cannot get the output figure..

`````` seed=11111
d = matrix(rnorm(200000,0,1),ncol=200) # 200 samples in columns, 1000 genes in rows
colnames(d) = paste("Samp",1:200,sep="")
rownames(d) = paste("Gene",1:1000,sep="")
d = sweep(d,1, apply(d,1,median,na.rm=T))
maxK = 6 # maximum number of clusters to try
results = ConsensusClusterPlus(d,maxK=maxK,reps=50,pItem=0.8,pFeature=1,title="test_run",
``````

where is the output going into? can you tell me

Sorry, finally did you get how to obtain image from consensusclusterPlus ?? I completed the analysis but no plot produced :(

Hi, Chih-Hsu Jack Lin, I didn't see the result of estimating k after I ran the code in the part of "PAC implementation" above. Could you explain to me what's wrong ? Thanks in advance.

3
Jean-Karim Heriche21k wrote:

Usually there is no true k in data, it often depends on what you're looking for. As an example, consider clustering proteins into complexes. Different people would have different view of the granularity of many complexes, e.g. should the proteasome be one complex or is it composed of the 20S core and the19S regulatory complex or maybe it should be subdivided into lid, base and alpha and beta rings ?

EDIT: After looking at your results, it seems you are in the case I described with the proteasome example. There seems to be structure in your data but which granularity you choose should depend on what the question you're trying to answer is. My choice from the heatmap images would be k=3 but if you're ready to accept k=2, you should also consider k=4. One of the consideration is whether you want to have small clusters.

the best way to infer best k,we can (1) not relying solely on the consensus matrix heatmap to declare the existence of clusters;(2)apply the proportion of ambiguous clustering (PAC),just the CDF plot .And consensus clustering+PAC has been shown to perform better than several other commonly used methods.We can therefore infer the optimal number of clusters by the K value having the lowest PAC.

And another question,As you can see in here,the fraction of samples with consensus indices fall in [0,1],but in my CDF plot,the consensus matrix CDFs all start from 0.3 or higher.So is there any problem or something else in my data?

Thanks.

1

Without seeing your data and what you've done (e.g. what clustering algorithms you use with which distance measure...), it's hard to go any further.
As I tried to explain, relying on this kind of analysis alone ignores the context of the data. You need to consider whether the clusters are meaningful/relevant to the question you're interested in. For example, does the very small cluster you get for k>=4 make any sense in the context of your question ? Maybe this identifies outliers that should be removed from the data.

So,we do not need to pay special attention to the cluster numbers,as long as the results meet my requirement,and the clusters are meaningful? Thanks a million.

1
Erik Wright360 wrote:

You could try using gap statistics: http://www.web.stanford.edu/~hastie/Papers/gap.pdf

There are several R functions available for this purpose.

The Gap-statistic does not perform well when p >> n.

you mean genes greater than samples?

It doesn't work well on high dimensional genomic data I have found.

It tends to just increase with K so choosing K is very hard in practice.

chris, out of curiosity, have you ever actually implemented the gap statistic in any analysis? I have used it repeatedly over the years but am usually disappointed in how it performs. I even produced a parallelised version of it, though: R functions edited for parallel processing

1

Yeah I tried running it on a bunch of cancer RNA-seq datasets, and basically it just increases as K increases. Which is a problem and it is hard to find the real K.

Finding K on the data we work on is actually quite a challenging problem, many methods fail in practice. GAP-statistic works well under certain conditions, but with high dimensional Gaussian data-sets it does not.

So far I have found M3C works nicely with a single platform (not single cell RNA-seq though) and is my preferred option. I also found CLEST provides similar results on this type of data and is quite nice. We are also working on some other stuff that is in the pipeline atm.