Question

Clustering of gene expression data with continuous clinical quantitative parameters of different range/units

0

Entering edit mode

7.2 years ago

svlachavas ▴ 790

Hello !!

I would like to use in R (as there i have conducted my total analysis) to perform some kind of clustering (i.e. hierarchical clustering) with two groups of variables describing the same samples. One group is microarray gene expression data (for specific genes) that have been normalized and batch effect corrected. The other group also has some quantitative clinical parameters that describe the same samples. However, these clinical variables have not been normalized or subjected to any kind of transformation(i.e. raw continuous values).

For example, one variable of these could have range of values from ~0.002518 to ~27.3, whereas another from 1.69 to 1.82, or (even 0.03 to 0.87).

Thus, as my ultimate goal in to implement hierarchical clustering and use both groups simultaneously (merged in a matrix/dataframe), in order to inspect which of these clinical variables cluster with specific genes:

1) Would be row scaling (kind of z-score transformation[substract for each row variable row mean, divide by row average]) be enough to handle all my continuous variables when merged, and perform my clustering ? As an option included in many heatmap R packages and functions ?

2) Or z-score in the sense of standardizing above, requires normal distributions/normally distributed data ? and thus, i have to transform initially-separately my clinical variables-for example with log2 transformation-then merge, row scale and perform the clustering ? My other concern here, is that due to the above range of the clinical quantitative parameters, perhaps a lot of negative values could appear after log2 transform.

3) For a similar analysis/approach, like constructing a correlation plot of the above total variables, would a simple row scaling be sufficient ?

Any suggestions or ideas would be beneficial !!

R microarray clustering clinical data • 3.7k views

ADD COMMENT • link updated 7.2 years ago by Jean-Karim Heriche 27k • written 7.2 years ago by svlachavas ▴ 790

score 0 · Answer 1 · 2017-01-30

Maybe not what you're looking for directly but I would take a different approach to combine the data. For clustering this kind of data, I would take a tensor factorization approach:

Compute two sample similarity matrices, one using the expression data and one using the clinical parameters.
Consider this as a tensor of order 3 with dimension n_samples x n_samples x 2 and apply a tensor factorization algorithm
Interpret the result in terms of clustering

Have a look at this tutorial. To relate your situation to the tutorial, you can view the two sample similarity matrices as adjacency matrices of a graph with samples as nodes (the tutorial uses genes as nodes).

Another approach, maybe more conventional, is to combine the two similarity matrices and then apply a clustering algorithm to the resulting similarity matrix.