Hello again Biostars community!
I am working on a project for my internship regarding the efficacy of different clustering techniques on RNA-seq data. I have used many different common clustering techniques in R and compared their results and the clusters they produce. Now, I have gotten to a point in the project where I want to analyze and explicitly show how my clustering is performing compared to "random" clustering. (Where I redistribute the "cluster" label across different groups of genes.)
To this point, I have identified k-means clustering as the most helpful for my dataset and identified some positive control groups that I know are biologically meaningful and should cluster together. For these positive controls, I have calculated a p-value and this is where my "random" groups come in!
For each of my clusters, I want to redistribute my genes as randomly as possible (within the limits of reproducibility) and examine how my p-values change for the positive control groups. Ideally I would like to see the p-values for my clusters showing much higher significance than the random clusters, but who knows? (I mean... I'm pretty sure I know, but that's why I'm doing this!)
I would like each new random cluster to contain the same number of genes as the clusters that I have already generated. (e.g. Cluster_1 had 60 genes, so Random_Cluster_1 should also have 60 genes etc.). Of course, I also need to make sure any one gene is only assigned to a single cluster.
Would anyone be able to recommend a robust method of "randomly" reassigning my genes to new clusters so that I can check my clustering performance?
Any feedback would be appreciated as I don't have much experience in this area (pretty new to bioinformatics!)
Thank you!
You might do a bootstrap test, sampling a subset of genes (control and treatment) and clustering them to see how error changes for given k. http://www.win-vector.com/blog/2016/02/finding-the-k-in-k-means-by-parametric-bootstrap/
I see, that's interesting. Earlier in my project I created an "elbow-graph" from the WSS values to attempt to gain some insight on the number of clusters I should use. It was not as helpful as I was hoping for, but I only tried it on my data, not with any "synthetic" data to compare it to. I haven't done anything like that before, but it does look very helpful. Thank you!