Reference

Question

Tutorial:Rembrandt Glioma Data Analysis (PART II) - Are Gender specific genes related to cancer ?

0

Entering edit mode

8.0 years ago

majuang66 ▴ 140

I post Rembrandt Glioma Data Analysis (PART II) - "Are Gender specific genes related to cancer ?" At PART I, I post that Rembrandt Glioma Data is clusterable. Next, If the data is clusterable, how do we determine the optimal number of cluster?

# Open the R program
# required library
library(factoextra)

library(cluster)

library(NbClust)

I use the data, mydata_filtered_scale_1 from Rembrandt Glioma Data Analysis (PART I, Rembrandt Glioma Data Analysis (PART I) - Are Gender specific genes related to cancer ?).

The three most popular methods for determining the optimal number o f clusters are Elbow, silhouette, and gap statistic(1).

The elbow and silhouette method are implemented in factoextra and cluster package,respectively, and can be computed using the function fviz_nbclust()(1)

The Gap statistic is in cluster package and can be visualized using the function fviz_gap_stat() of factoextra package.

#I. Results of Elbow method (1) - K-means
fviz_nbclust(mydata_filtered_scale_1,kmeans,method="wss")+geom_vline(xintercept=2,linetype=2)
![enter image description here][1]
#II. Results of Elbow method (1) - PAM
fviz_nbclust(mydata_filtered_scale_1,pam,method="wss")+geom_vline(xintercept=4,linetype=2)
![enter image description here][2]
#III. Results of Elbow method (1) - hierarchial cluster
fviz_nbclust(mydata_filtered_scale_1,hcut,method="wss")+geom_vline(xintercept=4,linetype=2)
![enter image description here][3]
#IV. Results of silhouette method (1) - K-means
fviz_nbclust(mydata_filtered_scale_1,kmeans,method="silhouette")
![enter image description here][4]
#V. Results of silhouette method (1) - PAM
fviz_nbclust(mydata_filtered_scale_1,pam,method="silhouette")
![enter image description here][5]
#VI. Results of silhouette method (1) - hierarchial cluster
fviz_nbclust(mydata_filtered_scale_1,hcut,method="silhouette",hc_method="complete")
![enter image description here][6]
#VII. Results of Gap statistic (1) - K-means
Number of Cluster k - 10 clusters
#VIII. Results of Gap statistic (1) - PAM
Number of Cluster k - 11 clusters 
#IX. Results of Gap statistic (1) - hierarchial cluster
Number of Cluster k - 11 clusters

Nbclust packages provide30 indicies for determining relevant number of clusters.

nb<-nbClust(mydata_filtered_scale_1,distance="euclidean",min.nc=2,max.nc=10,method="complete",index="all")


*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 

******************************************************************* 
* Among all indices:                                                
* 10 proposed 2 as the best number of clusters 
* 1 proposed 3 as the best number of clusters 
* 1 proposed 4 as the best number of clusters 
* 1 proposed 5 as the best number of clusters 
* 5 proposed 6 as the best number of clusters 
* 1 proposed 7 as the best number of clusters 
* 2 proposed 10 as the best number of clusters 

                   ***** Conclusion *****                            

**According to the majority rule, the best number of clusters is  2**


******************************************************************* 
> fviz_nbclust(nb)+theme_minimal()

enter image description here

Among all indices: 
===================
* 2 proposed  0 as the best number of clusters
* 10 proposed  2 as the best number of clusters
* 1 proposed  3 as the best number of clusters
* 1 proposed  4 as the best number of clusters
* 1 proposed  5 as the best number of clusters
* 5 proposed  6 as the best number of clusters
* 1 proposed  7 as the best number of clusters
* 2 proposed  10 as the best number of clusters
* 3 proposed  NA's as the best number of clusters

Conclusion
=========================
**According to the majority rule, the best number of clusters is  2**

# 549 Rembrandt samples composed of 43 genes might be divided into 2 clusters actually.

Reference

(1) Determining the optimal number of clusters:3 must known methods - Unsupervised Machine Learning (http://www.sthda.com)

gene R Rembrandt • 2.5k views

ADD COMMENT • link updated 15 months ago by Ram 44k • written 8.0 years ago by majuang66 ▴ 140