Question: How Can I Create Similarity Matrix From Gene Expression And Rnaseq Data?
0
5.3 years ago by
samsara570
The Earth
samsara570 wrote:

I have gene expression (log2 lowess normalized) of different samples as follows

``````genes    sample1    sample2    sample3    sample4
g1    -0.25    -0.91275    -0.641    0.37
g2    1.3245    -2.126    7.495    -1.151
g3    0.31775    0.731    -1.151    0.182
...
``````

I have rnaSeq (gene quanification data) with RPKM values

``````genes    sample1    sample2    sample3    sample4
g1    0.834890179247665    11.2357774823452    6.39239374912979    0.504388295468584
g2    0.1332993726877    1.09436685773882    0.00311332856051572    3.82123310407725
g3    0.764239475051307    0.609334844909887    0.107913669064748    2.71814202633155
...
``````

How can i create similarity matrix (among samples) based on above available distributions ?

rnaseq • 4.1k views
modified 2.1 years ago by Biostar ♦♦ 20 • written 5.3 years ago by samsara570

If you wanna use correlation as similarity this can easily be done in R. Else, you have to define what 'similarity' means to you. For example you can calculate a norm between two values representing the same gene in different samples, i.e. ||x_ij - x_ik ||² . Doing that you end up with very small values for genes which have similar expression values in different samples and large values for genes which have different expression values.

When you say "among samples" are you referring both the data matrices that you have? What I mean is, do you want to

a) Calculate a separate similarity matrix for the samples from your "gene expression (log2 lowess normalized)" data and separately for the "rnaSeq (gene quanification data)" OR b) Calculate the similarity among samples both within and across the two data you have.

In both the cases you can get the correlation coefficient (for example, pearson's correlation coefficient) for any two columns. You would also be interested in the p-value associated with the coefficient, which will tell you how statistically significant that value is.

0
5.3 years ago by
User 1933340
User 1933340 wrote:

You need to measure the similarity. There are different methods, such as correlation, information theory and a other kernel methods. for example in R

let say you have a dataset like this

``````dt = replicate(3, rnorm(4))
``````

there are bunch of kernel functions in the kernlab library

say we want to measure the similarity using rbf function.

``````rbf <- rbfdot(sigma = 0.05)
kernelMatrix(rbf, dt)
``````

or for using correlation, you can simply use

``````cor(dt)
``````

you can make a one universal correlation matrix in different way, either merge them in the first step (how ever seems they have scaling issue, you might need to replace the values with rank - or instead first normalize them). Or after making the correlation matrix, you can bind them, using a dot product.

If you explain your further analysis, like what you want to do after making this similarity matrices I might be able to make more concrete suggestions.