I calculated `GSVA`

scores for a large set of tumors using a fairly small set of genes (~8 genes in the signature).

I found there to be a distinct bimodal distribution in my data (yes, `mx.diff = T`

).

I then wanted to see if I could 'orthogonally' verify this by hierarchical clustering the tumors using the same small gene signature as the expression set (size factor normalized using `DESeq2`

, log2(x+1) transformed). I expected to see two distinct clusters, which would ultimately correlate to my two `GSVA`

modes. Euclidean distance hierarchical clustering on these data yielded results as expected: two rather disparate groups that correlated to the two modes of `GSVA`

.

However, calculating a 1-Pearson distance matrix (after median centering the data) left me with a dendrogram that looks almost like a fractal: not a distinguishable cluster in sight.

My question is: why do I see this disparity, and which result should I regard as more realistic? Or have I gotten this whole thing turned around an is neither reliable?

Histogram of enrichment scores, showing bimodality:

Hierarchical clustering using a euclid distance metric

Hierarchical clustering using a 1- pearson correlation distance

Let me know if you need anything else!

Thank you for your time - let me know if I need to clarify anything.

Can you share the results?

Histogram of enrichment scores, showing bimodality:

Hierarchical clustering using a euclid distance metric

Hierarchical clustering using a 1- pearson correlation distance

Let me know if you need anything else!

Are you clustering the GSVA results just based on the one enrichment score?

Regardless, I am not sure you necessarily expect to have two distinct clusters based on the expression values. The GSVA values have a bimodal distribution, but there are still many samples in the middle. The middle samples would not cleanly cluster with either group.

I'm clustering using the normalized, log transformed expression of the 8 or so genes that were used to generate the GSVA score - returning to the roots of the dataset, if you will.

I think you bring up a good point about not necessarily seeing two distinct clusters. I think was confused me the most is the presence of two clusters using the euclidean distance metric but the lack of one using the 1-pearson distance metric. Are you able to speak to that at all?

Thank you for your time!