**1.3k**wrote:

This is more of a general question / situation.

I was recently brought on to a project which generates network fusions between mRNA and miRNA data from cancer patients. The deliverable is 3 distinct clusters / groups of patients which show significant differences in molecular profiles and survival.

For sake of posterity, I redid the analysis that the previous bioinformatician did, and generated the same raw results. However, there was a major discrepancy between how we interpreted the optimal number of clusters. In total, we both tested ~150 different parameters into the test. To find the optimal solution and # of clusters, I took the parameter set which ranked the highest median silhouette width. Whereas the previous bioinformatician, did some sort of PCA analysis on each dataset individually prior to the network fusion, then took the result with the highest median silhouette width for the predetermined # clusters.

Overall, I'm interpreting 2 clusters as optimal with a silhouette width of ~0.8, and he's interpreting 3 clusters as optimal with a silhouette width of ~0.07. So a pretty staunch difference. Interestingly enough, I think both solutions capture one of the groups quiet well, in particular the better survival group because no matter how they are clustered they are spread far apart from the other 1/2 groups. The remaining two groups essentially overlap to a large degree while having similar functional pathway analyses and survival. So my thinking is that it may be better to just take the 2 clusters as the optimal solution. If not only for the shear fact that it's more statistically sound to do so, then for the reason that biologically the 2 suspect groups are very similar to one another. Which for purposes of validation, would be easier to test. I mean it would be impossible to validate 3 groups if 2 of the groups are artificial, right?

Would anyone mind giving me advice on this? I'm running up against a wall with my people on this. I don't want to rock the boat on this, but I honestly believe we made a mistake. I try to bring it up, and present the evidence, but one person in particular is just dead set on ignoring it outright as some sort of statistical facade (interesting to note, it's not the bioinformatician).

Ridiculously frustrated over here.