Finding speaman's rho correlation matrix
1
0
Entering edit mode
2.2 years ago
Natasha ▴ 40

Hi All,

This is a follow-up to my previous post here

I intend to cluster tissues based on gene expression levels. I am trying to replicate figure 1 of this paper

Based on the inputs given in my previous post, the input data has been converted to the following format using categorical information of gene expression levels for more than 1000 genes. I have presented the data with two columns of ensembl gene id's for the purpose of illustration.

                  ENSG00000000003 ENSG00000000419 ....
appendix                 2.000000        3.500000 ...
bone marrow              1.000000        3.000000 ...
breast                   2.000000        3.000000 ...
bronchus                 4.000000        3.000000 ...
caudate                  1.000000        2.500000 ...


From the above data, I'd like to compute the spearman's rho correlation matrix and convert it to a distance measure for clustering.

Could someone explain how spearman's rho correlation has to be computed ? (I looked at in-built functions in R suggested in my previous post. However, I would like to understand how it is computed)

gene-expression tissue correlation spearman • 496 views
0
Entering edit mode
2.2 years ago

There's a great explanation of how it is calculated here: https://en.m.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

0
Entering edit mode

Many thanks for the link. I read thorough the explanation . I'd like to ask for clarifications on how to interpret the computation of correlation matrix

The following is the sample data that is considered

df

adrenal gland appendix bone marrow   breast bronchus
ENSG00000000003             1      2.0           1 2.000000        4
ENSG00000000419             4      3.5           3 3.000000        3
ENSG00000000457             1      1.5           2 2.666667        1
ENSG00000000460             3      1.5           2 3.000000        3


Using corr <- cor(df,method = "spearman")

the following output is obtained

              adrenal gland   appendix bone marrow      breast   bronchus
adrenal gland     1.0000000 0.50000000   0.8333333  0.88888889  0.0000000
appendix          0.5000000 1.00000000   0.3333333  0.05555556  0.5000000
bone marrow       0.8333333 0.33333333   1.0000000  0.83333333 -0.5000000
breast            0.8888889 0.05555556   0.8333333  1.00000000 -0.3333333
bronchus          0.0000000 0.50000000  -0.5000000 -0.33333333  1.0000000


From what I understand the above matrix is constructed using df^T(transpose)*df which gives a tissue x tissue correlation matrix with variances on the diagonals and covariance on the non-diagonal entries. Could you please explain how this matrix can be interpreted?

0
Entering edit mode

Also, in the above-mentioned link a formula is mentioned when all the ranks are distinct. Could you please explain how to assign ranks when the values of a variable is not distinct (e.g data stored in df)?