Tutorial:Rembrandt Glioma Data Analysis (PART II) - Are Gender specific genes related to cancer ?
0
0
Entering edit mode
7.8 years ago
majuang66 ▴ 140

I post Rembrandt Glioma Data Analysis (PART II) - "Are Gender specific genes related to cancer ?" At PART I, I post that Rembrandt Glioma Data is clusterable. Next, If the data is clusterable, how do we determine the optimal number of cluster?

# Open the R program
# required library
library(factoextra)

library(cluster)

library(NbClust)

I use the data, mydata_filtered_scale_1 from Rembrandt Glioma Data Analysis (PART I, Rembrandt Glioma Data Analysis (PART I) - Are Gender specific genes related to cancer ?).

The three most popular methods for determining the optimal number o f clusters are Elbow, silhouette, and gap statistic(1).

The elbow and silhouette method are implemented in factoextra and cluster package,respectively, and can be computed using the function fviz_nbclust()(1)

The Gap statistic is in cluster package and can be visualized using the function fviz_gap_stat() of factoextra package.

#I. Results of Elbow method (1) - K-means
fviz_nbclust(mydata_filtered_scale_1,kmeans,method="wss")+geom_vline(xintercept=2,linetype=2)
![enter image description here][1]
#II. Results of Elbow method (1) - PAM
fviz_nbclust(mydata_filtered_scale_1,pam,method="wss")+geom_vline(xintercept=4,linetype=2)
![enter image description here][2]
#III. Results of Elbow method (1) - hierarchial cluster
fviz_nbclust(mydata_filtered_scale_1,hcut,method="wss")+geom_vline(xintercept=4,linetype=2)
![enter image description here][3]
#IV. Results of silhouette method (1) - K-means
fviz_nbclust(mydata_filtered_scale_1,kmeans,method="silhouette")
![enter image description here][4]
#V. Results of silhouette method (1) - PAM
fviz_nbclust(mydata_filtered_scale_1,pam,method="silhouette")
![enter image description here][5]
#VI. Results of silhouette method (1) - hierarchial cluster
fviz_nbclust(mydata_filtered_scale_1,hcut,method="silhouette",hc_method="complete")
![enter image description here][6]
#VII. Results of Gap statistic (1) - K-means
Number of Cluster k - 10 clusters
#VIII. Results of Gap statistic (1) - PAM
Number of Cluster k - 11 clusters 
#IX. Results of Gap statistic (1) - hierarchial cluster
Number of Cluster k - 11 clusters

Nbclust packages provide30 indicies for determining relevant number of clusters.

nb<-nbClust(mydata_filtered_scale_1,distance="euclidean",min.nc=2,max.nc=10,method="complete",index="all")


*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 

******************************************************************* 
* Among all indices:                                                
* 10 proposed 2 as the best number of clusters 
* 1 proposed 3 as the best number of clusters 
* 1 proposed 4 as the best number of clusters 
* 1 proposed 5 as the best number of clusters 
* 5 proposed 6 as the best number of clusters 
* 1 proposed 7 as the best number of clusters 
* 2 proposed 10 as the best number of clusters 

                   ***** Conclusion *****                            

**According to the majority rule, the best number of clusters is  2**


******************************************************************* 
> fviz_nbclust(nb)+theme_minimal()

enter image description here

Among all indices: 
===================
* 2 proposed  0 as the best number of clusters
* 10 proposed  2 as the best number of clusters
* 1 proposed  3 as the best number of clusters
* 1 proposed  4 as the best number of clusters
* 1 proposed  5 as the best number of clusters
* 5 proposed  6 as the best number of clusters
* 1 proposed  7 as the best number of clusters
* 2 proposed  10 as the best number of clusters
* 3 proposed  NA's as the best number of clusters

Conclusion
=========================
**According to the majority rule, the best number of clusters is  2**

# 549 Rembrandt samples composed of 43 genes might be divided into 2 clusters actually.

Reference

(1) Determining the optimal number of clusters:3 must known methods - Unsupervised Machine Learning (http://www.sthda.com)

gene R Rembrandt • 2.5k views
ADD COMMENT

Login before adding your answer.

Traffic: 3111 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6