Been a long time, but I have some queries and would like to have some feedbacks. I would try to be as simple as I can. I am having some conceptual pitfall for reducing the number of CpG sites that are needed to be validated. I performed an analysis of DNA methylation where I found around 150 CpG sites that are differentially methylated between two different tissues. Now we need to reduce them to a number that can be validated. Our lab is not interested in validating based on fold changes criterion but the more robust approach of classification. I am not into supervised machine learning approach and we lack such expertise. So I came up with an idea based on partitioning method and unsupervised hierarchical clustering.
How does it work:
- Given the data matrix of CpG sites with the beta values, obtain optimal k cluster from the data using kmeans by WSS, SILHOUETTE or GAPSTAT
- Use that k cluster to find the CpG sites in each cluster and then ranking each site inside of the clusters Select the top ranks from each cluster.
- This gives me a restricted list and also the ranking metric is applied in a way that it is able to still uphold the hierarchical clustering of our tissues.
Now this is something I have done earlier on RNA-Seq and I know that for RNA-seq one needs to scale the data, so I did the scaling on the rows since the final output should be reducing the genes that are rows in my data matrix. For methylation, on the other hand, we know that the starting values are actually between 0 to 1. So finding the optimal clusters on the CpG matrix having sites in rows and samples in columns where values range from 0-1, should I still scale for rows and then plot the optimal number of k clusters that the data conveys? Since results are different if I do not scale or if I scale on rows or even on columns. My notion is not to scale for anything for the CpG sites if I have to find an optimal number of Kmeans K clusters using WSS or GAPSTAT or SILHOUETTE method. These are partitioning methods and mostly partitioning methods need to be scaled and I have seen mostly for gene expression but such are never applied on methylation. Or maybe I have missed out. I would like to have an opinion what others think about this approach and if at all I need to scale the data or before feeding the matrix to find out how many optimal k clusters are needed for this system. If scaling is applied should it be row scaling or column scaling? Since I see heat maps of methylation mostly have not scaled while it is plotted. Any feedbacks is appreciated. And if someone wants I can put up more information. Thanks