Tutorial: Rembrandt Glioma Data Analysis (PART II) - Are Gender specific genes related to cancer ?
0
gravatar for majuang66
3.3 years ago by
majuang6690
majuang6690 wrote:

I post Rembrandt Glioma Data Analysis (PART II) - "Are Gender specific genes related to cancer ?" At PART I, I post that Rembrandt Glioma Data is clusterable. Next, If the data is clusterable, how do we determine the optimal number of cluster?

Open the R program

required library

library(factoextra)

library(cluster)

library(NbClust)

I use the data, mydata_filtered_scale_1 from Rembrandt Glioma Data Analysis (PART I, https://www.biostars.org/p/201198/). The three most popular methods for determining the optimal number o f clusters are Elbow, silhouette, and gap statistic(1). * The elbow and silhouette method are implemented in factoextra and cluster package,respectively, and can be computed using the function fviz_nbclust()(1)* * The Gap statistic is in cluster package and can be visualized using the function fviz_gap_stat() of factoextra package.*

I. Results of Elbow method (1) - K-means

fviz_nbclust(mydata_filtered_scale_1,kmeans,method="wss")+geom_vline(xintercept=2,linetype=2) enter image description here

II. Results of Elbow method (1) - PAM

fviz_nbclust(mydata_filtered_scale_1,pam,method="wss")+geom_vline(xintercept=4,linetype=2) enter image description here

III. Results of Elbow method (1) - hierarchial cluster

fviz_nbclust(mydata_filtered_scale_1,hcut,method="wss")+geom_vline(xintercept=4,linetype=2) enter image description here

IV. Results of silhouette method (1) - K-means

fviz_nbclust(mydata_filtered_scale_1,kmeans,method="silhouette") enter image description here

V. Results of silhouette method (1) - PAM

fviz_nbclust(mydata_filtered_scale_1,pam,method="silhouette") enter image description here

VI. Results of silhouette method (1) - hierarchial cluster

fviz_nbclust(mydata_filtered_scale_1,hcut,method="silhouette",hc_method="complete") enter image description here

VII. Results of Gap statistic (1) - K-means

Number of Cluster k - 10 clusters

VIII. Results of Gap statistic (1) - PAM

Number of Cluster k - 11 clusters

IX. Results of Gap statistic (1) - hierarchial cluster

Number of Cluster k - 11 clusters

* Nbclust packages provide30 indicies for determining relevant number of clusters.*

nb<-nbClust(mydata_filtered_scale_1,distance="euclidean",min.nc=2,max.nc=10,method="complete",index="all")

* : The Hubert index is a graphical method of determining the number of clusters. In the plot of Hubert index, we seek a significant knee that corresponds to a significant increase of the value of the measure i.e the significant peak in Hubert index second differences plot.

* : The D index is a graphical method of determining the number of clusters. In the plot of D index, we seek a significant knee (the significant peak in Dindex second differences plot) that corresponds to a significant increase of the value of the measure.


  • Among all indices:
  • 10 proposed 2 as the best number of clusters
  • 1 proposed 3 as the best number of clusters
  • 1 proposed 4 as the best number of clusters
  • 1 proposed 5 as the best number of clusters
  • 5 proposed 6 as the best number of clusters
  • 1 proposed 7 as the best number of clusters
  • 2 proposed 10 as the best number of clusters

               ***** Conclusion *****
    

According to the majority rule, the best number of clusters is 2


fviz_nbclust(nb)+theme_minimal() enter image description here

Among all indices:

  • 2 proposed 0 as the best number of clusters
  • 10 proposed 2 as the best number of clusters
  • 1 proposed 3 as the best number of clusters
  • 1 proposed 4 as the best number of clusters
  • 1 proposed 5 as the best number of clusters
  • 5 proposed 6 as the best number of clusters
  • 1 proposed 7 as the best number of clusters
  • 2 proposed 10 as the best number of clusters
  • 3 proposed NA's as the best number of clusters

Conclusion

According to the majority rule, the best number of clusters is 2

549 Rembrandt samples composed of 43 genes might be divided into 2 clusters actually.

Reference

(1) Determining the optimal number of clusters:3 must known methods - Unsupervised Machine Learning (http://www.sthda.com)

rembrandt tutorial R gene • 1.3k views
ADD COMMENTlink written 3.3 years ago by majuang6690
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1925 users visited in the last hour