Question: hierarchial clustering of genes- identify number of clusters
gravatar for hodayabeer
2.7 years ago by
hodayabeer10 wrote:

Hi all,

I have a data set where the rows are genes and columns are phylogenetic profiling scores.

I clustered this dataset of genes in Hierarchial Clustering in R, and got a dendogram of the hclust() output. I need to identify the number of clusters, so that genes in the same cluster will be very similar to each other according to the values of the columns, (genes that have similar values in the same columns belong to the same cluster) and basicly split the data into modules. I need to find a systematic way to do that, simultaneously on a lot of datasets, without the involvement and optimization of human.

I used the function NbClust() which gave a not enough appropriate output as some genes appear in the same cluster although they are not enough similar:(

I would really appriciate to get an idea of a R function to take out genes that are not related to the cluster, or a better function to determine the best number of clusters that consider the possibility to not include some genes.

Thank you!

ADD COMMENTlink modified 2.7 years ago by Kevin Blighe65k • written 2.7 years ago by hodayabeer10
gravatar for Kevin Blighe
2.7 years ago by
Kevin Blighe65k
Kevin Blighe65k wrote:

For extracting information from the clustering, take a look at my answer here: A: extract dendrogram cluster from pheatmap This is a very crude way of deciding ideal cluster number, though, due to the fact that you the human is deciding where to cut the tree manually, although, if you cluster using correlation distance as the dissimilarities, then you can easily say that you identified cluster groups based on Pearson correlation>0.9, for example.

Other ways of deciding ideal cluster number in a dataset include —but are not limited to—:

  • Silhouette method
  • Elbow metod
  • gap statistic
  • Consensus Clustering

All of these have implementations in R.

I published on this recently in the context of asthma and vitamin D: Vitamin D prenatal programming of childhood metabolomics profiles at age 3 y.

ADD COMMENTlink written 2.7 years ago by Kevin Blighe65k

Thank you for your answer. What I am trying to do is to write a code that will identify number of clusters in a lot of datasets, simultaniously. Therefore, I am trying to find a systematic way to do that, without the involvment of human. So how can I apply your suggestion in a R code, and let the algorithm determine the 'h' or the 'k' in cutree() based on the Pearson correlation?

ADD REPLYlink written 2.7 years ago by hodayabeer10

Whilst automation is good, you should never completely disengage from the computer. There are instances where automated processes fail us, and across various industries, sometimes with fatal consequences.

To do what you want to do, just set up a loop to look over each dataset and then output results in a simple text file or to terminal output for you to then screen them. If you want ideas for loops, look at my code here: R functions edited for parallel processing

Note that you can save object names in a vector and then call them one-by-one in a loop:

mat1 <- matrix(rexp(50, rate=0.1), ncol=10)
mat2 <- matrix(rexp(50, rate=0.1), ncol=10)
mat3 <- matrix(rexp(50, rate=0.1), ncol=10)

MyDataMatrices <- c("mat1", "mat2", "mat3")

for (i in 1:lengt(MyDataMatrices))
       currentDataMatrix <- get(MyDataMatrices[i])
       [do processing on currentDataMatrix]
ADD REPLYlink written 2.7 years ago by Kevin Blighe65k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1148 users visited in the last hour