Question: how to choose optimal K in Consensus clustering
3
gravatar for LJ
2.4 years ago by
LJ150
LJ150 wrote:

Hi all, I am trying choose best k from the consensus clustering using the "ConsensusClusterPlus" R package.And the result images are :

k=2,heatmaps of the consensus matrices

k=3,heatmaps of the consensus matrices

k=4,heatmaps of the consensus matrices

k=5,heatmaps of the consensus matrices

CDF plot

tracking plot

Delta Area Plot

So which k should i choose? 2 or 3? And it is difficult to find the true k only with the CDF plot,So apart from the method of checking out the heatmaps of the consensus matrices and the CDF plot, what are some other reliable ways we can choose for finding the optimal k in consensus clustering?

Thanks in advance.

consensus clustering R cluster • 4.4k views
ADD COMMENTlink modified 2.1 years ago by Chih-Hsu Jack Lin60 • written 2.4 years ago by LJ150

Note: Access to the result images in the link above requires signing up to the service.

ADD REPLYlink written 2.4 years ago by Jean-Karim Heriche17k

I am seeing the images when I click the links and did not need to sign up.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by genomax58k

sorry, i made a mistake in uploading images, you should now have access to the images.

ADD REPLYlink written 2.4 years ago by LJ150

This is a question that comes up a lot. Could you re-post the images?

ADD REPLYlink written 2.0 years ago by SplitInf0
6
gravatar for Chih-Hsu Jack Lin
2.1 years ago by
Baylor College of Medicine
Chih-Hsu Jack Lin60 wrote:

PAC has been shown to outperform other K-estimating methods (e.g., ) in this paper and this paper

Dr. Yasin Şenbabaoğlu has kindly provided the R implementation of PAC . You can use the results in ConsensusClusterPlus as input to get optimal K based on minimum PAC. The code is from here.

######################################################## 
seed=11111
d = matrix(rnorm(200000,0,1),ncol=200) # 200 samples in columns, 1000 genes in rows
colnames(d) = paste("Samp",1:200,sep="")
rownames(d) = paste("Gene",1:1000,sep="")
d = sweep(d,1, apply(d,1,median,na.rm=T))
maxK = 6 # maximum number of clusters to try
results = ConsensusClusterPlus(d,maxK=maxK,reps=50,pItem=0.8,pFeature=1,title="test_run",
innerLinkage="complete",seed=seed,plot="pdf")

# Note that we implement consensus clustering with innerLinkage="complete". 
# We advise against using innerLinkage="average" which is the default value in this package as average linkage is not robust to outliers.

############## PAC implementation ##############
Kvec = 2:maxK
x1 = 0.1; x2 = 0.9 # threshold defining the intermediate sub-interval
PAC = rep(NA,length(Kvec)) 
names(PAC) = paste("K=",Kvec,sep="") # from 2 to maxK
for(i in Kvec){
  M = results[[i]]$consensusMatrix
  Fn = ecdf(M[lower.tri(M)])
  PAC[i-1] = Fn(x2) - Fn(x1)
}#end for i
# The optimal K
optK = Kvec[which.min(PAC)]
########################################################
ADD COMMENTlink written 2.1 years ago by Chih-Hsu Jack Lin60

Can you tell me how are you getting the output im running the consensusclusterPlus library but i cannot get the output figure..

 seed=11111
    d = matrix(rnorm(200000,0,1),ncol=200) # 200 samples in columns, 1000 genes in rows
    colnames(d) = paste("Samp",1:200,sep="")
    rownames(d) = paste("Gene",1:1000,sep="")
    d = sweep(d,1, apply(d,1,median,na.rm=T))
    maxK = 6 # maximum number of clusters to try
    results = ConsensusClusterPlus(d,maxK=maxK,reps=50,pItem=0.8,pFeature=1,title="test_run",
    innerLinkage="complete",seed=seed,plot="pdf")

where is the output going into? can you tell me

ADD REPLYlink written 11 months ago by krushnach80420

Hi, Chih-Hsu Jack Lin, I didn't see the result of estimating k after I ran the code in the part of "PAC implementation" above. Could you explain to me what's wrong ? Thanks in advance.

ADD REPLYlink written 10 months ago by gjliu03250

Are you aware of any studies that have validated the PAC as the most accurate K-estimating measure?

I've seen countless studies proposing a new method for X which appear to show that the new method outperforms the old method, only for that study to be refuted in the future.

Unless the PAC has been validated in newer studies, I would hesitate to replace tried and true measures (such as the alternatives mentioned in the PAC paper) with the PAC method.

Additionally -- as a comment below states , no matter what method you use for choosing K, you must determine whether this K makes sense for the question you are trying to answer and the context of the study. For instance, If I am trying to determine molecular subtypes of diseases X, and the PAC method tells me that there are 10 subtypes, but we can find no clear defining features between subtypes 9 + 10, then there is really no point in saying there are 10 subtypes, because we can only support 9. Hope that makes sense.

EDIT: forgot to mention that in my own personal experience the PAC tends to overestimate the number of clusters -- which is why I am skeptical of the PAC until it is proven in other studies.

ADD REPLYlink modified 6 months ago • written 6 months ago by reilly.brian.m50
1

The PAC method over-fits and declares false positives (ref below). The authors below propose using the RCSI, which is a derivative of the PAC score, but uses a Monte Carlo reference procedure like the GAP-statistic does. My advice, try a few methods and see what works best on your data.

In my experience I have found the RCSI rather effective. The PAC is however, somewhat better than relying on the 'tried and tested' delta K method which has been shown to be v. problematic now in two papers, despite this people still go on using it because it is historically popular and so seen as a safe option.

https://www.biorxiv.org/content/early/2018/07/25/377002
https://bioconductor.org/packages/devel/bioc/html/M3C.html

ADD REPLYlink modified 3 months ago • written 3 months ago by chris86170
2
gravatar for Jean-Karim Heriche
2.4 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche17k wrote:

Usually there is no true k in data, it often depends on what you're looking for. As an example, consider clustering proteins into complexes. Different people would have different view of the granularity of many complexes, e.g. should the proteasome be one complex or is it composed of the 20S core and the19S regulatory complex or maybe it should be subdivided into lid, base and alpha and beta rings ?
See also this previous post.

EDIT: After looking at your results, it seems you are in the case I described with the proteasome example. There seems to be structure in your data but which granularity you choose should depend on what the question you're trying to answer is. My choice from the heatmap images would be k=3 but if you're ready to accept k=2, you should also consider k=4. One of the consideration is whether you want to have small clusters.

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Jean-Karim Heriche17k

Thanks for your reply. But according to the "ConsensusClusterPlus" Reference Manual,

the best way to infer best k,we can (1) not relying solely on the consensus matrix heatmap to declare the existence of clusters;(2)apply the proportion of ambiguous clustering (PAC),just the CDF plot .And consensus clustering+PAC has been shown to perform better than several other commonly used methods.We can therefore infer the optimal number of clusters by the K value having the lowest PAC.

And another question,As you can see in here,the fraction of samples with consensus indices fall in [0,1],but in my CDF plot,the consensus matrix CDFs all start from 0.3 or higher.So is there any problem or something else in my data?

Thanks.

ADD REPLYlink written 2.4 years ago by LJ150
1

Without seeing your data and what you've done (e.g. what clustering algorithms you use with which distance measure...), it's hard to go any further.
As I tried to explain, relying on this kind of analysis alone ignores the context of the data. You need to consider whether the clusters are meaningful/relevant to the question you're interested in. For example, does the very small cluster you get for k>=4 make any sense in the context of your question ? Maybe this identifies outliers that should be removed from the data.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Jean-Karim Heriche17k

So,we do not need to pay special attention to the cluster numbers,as long as the results meet my requirement,and the clusters are meaningful? Thanks a million.

ADD REPLYlink written 2.4 years ago by LJ150
1
gravatar for Erik Wright
2.4 years ago by
Erik Wright350
Erik Wright350 wrote:

You could try using gap statistics: http://www.web.stanford.edu/~hastie/Papers/gap.pdf

There are several R functions available for this purpose.

ADD COMMENTlink written 2.4 years ago by Erik Wright350

The Gap-statistic does not perform well when p >> n.

ADD REPLYlink written 3 months ago by chris86170

you mean genes greater than samples?

ADD REPLYlink written 7 weeks ago by krushnach80420

It doesn't work well on high dimensional genomic data I have found.

It tends to just increase with K so choosing K is very hard in practice.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by chris86170

chris, out of curiosity, have you ever actually implemented the gap statistic in any analysis? I have used it repeatedly over the years but am usually disappointed in how it performs. I even produced a parallelised version of it, though: R functions edited for parallel processing

ADD REPLYlink written 25 days ago by Kevin Blighe32k
1

Yeah I tried running it on a bunch of cancer RNA-seq datasets, and basically it just increases as K increases. Which is an over fitting effect and it is hard to find the real K.

Finding K on the data we work on is actually quite a challenging problem, many methods fail in practice. GAP-statistic works well under certain conditions, but with high dimensional Gaussian data-sets it does not.

So far I have found M3C works nicely with a single platform and is my preferred solution atm, but it can't handle multi omic data. SNF is probably currently the best at multi omic data.

ADD REPLYlink modified 5 days ago • written 5 days ago by chris86170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 788 users visited in the last hour