I am analyzing the proteome data of cells from patients with different monogenic backgrounds of a specific disease. For cluster analysis, I applied the k-means algorithm from the stats package and used sum-of-square and silhouette width to determine the optimal amount of clusters. Next, I plotted the data as a tSNE and encircled the k-means clusters with colored shading. I am using tSNE because PCA analysis showed that at least 8 dimensions are needed to include the variance of the dataset so a PC1/2 plot would be insufficient (and actually it looks bad/convoluted)
The resultant tSNE plots look very good and most clusters contain samples from the same genotype, strengthening our hypothesis that the proteome profiles are indeed genotype specific. But there are two additional clusters that contain a mix of genotypes. I plotted different metadata from the sample annotation and found that the two "extra cluster" contain samples from patients that received unusual treatment schedules.
I wonder if I could compute some kind of correlation of the samples in the clusters with their respective metadata to define mathematically if the cluster assignment is driven more strongly by genotype or treatment.
Here are the two plots to demonstrate what I mean:
Do you have any suggestions on how I can compute if genotype or treatment drives the cluster assignment?
Thank you very much!