Question: How to find correlation between two datasets/vectors?
0
13 months ago by
John210
United States
John210 wrote:

Hi there,

I have two datasets. Matrix A of six columns and 300 rows. Matrix B of 12 columns and 300 rows. I want to find correlation between (each) one column of matrix A and all the columns of Matrix B.

Basically to show which column of Matrix B has similar expression pattern as Matrix A. Can anyone help me with R code?

Thanks.

rna-seq • 652 views
modified 13 months ago • written 13 months ago by John210
2
13 months ago by
Kevin Blighe59k
Kevin Blighe59k wrote:

To do the correlation, you can just follow this reproducible example using random data:

``````testmatrix1 <- matrix(rexp(1800, rate=.1), ncol=6)
colnames(testmatrix1) <- paste0("testmatrix1.Sample", 1:ncol(testmatrix1))

testmatrix2 <- matrix(rexp(3600, rate=.1), ncol=12)
colnames(testmatrix2) <- paste0("testmatrix2.Sample", 1:ncol(testmatrix2))

cor(testmatrix1, testmatrix2)

testmatrix2.Sample1 testmatrix2.Sample2 testmatrix2.Sample3
testmatrix1.Sample1          0.01296415       -0.0648876401        -0.026437428
testmatrix1.Sample2          0.02507722       -0.0466083100         0.051803640
testmatrix1.Sample3         -0.04509608        0.0885850839         0.016198746
testmatrix1.Sample4         -0.07253889        0.0109682660        -0.005428415
testmatrix1.Sample5          0.05309693        0.0005335641         0.025417901
testmatrix1.Sample6         -0.08103197       -0.0316691412         0.025865644
testmatrix2.Sample4 testmatrix2.Sample5 testmatrix2.Sample6
testmatrix1.Sample1        -0.044628657         -0.05145097        -0.084299004
testmatrix1.Sample2         0.091026535         -0.06536601         0.066175930
testmatrix1.Sample3         0.042544779         -0.02039660         0.111458092
testmatrix1.Sample4         0.040775626          0.04140741        -0.003768371
testmatrix1.Sample5        -0.026534599          0.05096871        -0.073291082
testmatrix1.Sample6        -0.008812823          0.08488437        -0.057106243
testmatrix2.Sample7 testmatrix2.Sample8 testmatrix2.Sample9
testmatrix1.Sample1        -0.053686505        -0.005479353        -0.024188906
testmatrix1.Sample2         0.151294463         0.030159951        -0.053079907
testmatrix1.Sample3         0.064047992        -0.045964736        -0.090923084
testmatrix1.Sample4        -0.149304622        -0.019311322        -0.063244353
testmatrix1.Sample5        -0.007202293        -0.039786009        -0.006401899
testmatrix1.Sample6        -0.082244307        -0.090395454        -0.027735417
testmatrix2.Sample10 testmatrix2.Sample11
testmatrix1.Sample1          -0.01897578          0.023409611
testmatrix1.Sample2           0.08483810          0.108088263
testmatrix1.Sample3          -0.01810386         -0.004953135
testmatrix1.Sample4           0.03711702          0.029325742
testmatrix1.Sample5           0.10329574         -0.026606608
testmatrix1.Sample6           0.02143732          0.044028624
testmatrix2.Sample12
testmatrix1.Sample1          -0.01421724
testmatrix1.Sample2          -0.03972874
testmatrix1.Sample3          -0.01106090
testmatrix1.Sample4           0.16243897
testmatrix1.Sample5          -0.04172540
testmatrix1.Sample6           0.04639565
``````

By default, `cor()` uses Pearson correlation. To use Kendall or Spearman, take a look at the man page.

This does not specifically address the sentiment of "similar expression pattern", though.

Kevin

thanks, how can I show whether it has "similar expression pattern"

Here is the limitation of correlation:

``````x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(10,11,15,17,26,77,190,200,267,300)

cor(x, y)
[1] 0.9337854

cor(x/1000, y*1000)
[1] 0.9337854

cor(x+50, y*1000000000000000000)
[1] 0.9337854
``````

It does not take into account the magnitude of the underlying values.

To actually check expression patterns, I would merge the datasets and then do, for example:

• PCA
• hierarchical clustering (with heatmap) on the Z-transformed merged dataset (in R, just use `scale(x)` for col scaling or `t(scale(t(x)))` row scaling)

thanks for enlightening me!

but the problem is they are from different platform, one is single cell ,another is ribotag seq.

here's my explanation

1

So, you are aiming to do some sort of multi-omics analysis? In that case, the correlation may be one of the easiest options. You can also use the cor.test function in place of `cor()` in order to derive a p-value for the correlation.

Another option may be to perform a regression between the datasets based on the genes.

okay, Thanks a-lot!

How can I make sure whether the cor.test function exactly find correlation between two matrix with same order of genes? (as they are two different data-frames, I'm afraid if the order is wrong!)

my code is below, x is one matrix, y is another.

``````for (j in 1:length(x)) {
for (i in 1:length(y)) { a <- cor.test(x[,j], y[,i], method = "spearman")
output[i,j]<- paste(a\$estimate)
}
}
``````

Hey, to create the matrix of p-values for the correlation test, it is indeed a bit more difficult and requires a loop of some sort. To build on the code that you have written, you just need to do:

## create the random datasets again

``````testmatrix1 <- matrix(rexp(1800, rate=.1), ncol=6)
colnames(testmatrix1) <- paste0("testmatrix1.Sample", 1:ncol(testmatrix1))
testmatrix2 <- matrix(rexp(3600, rate=.1), ncol=12)
colnames(testmatrix2) <- paste0("testmatrix2.Sample", 1:ncol(testmatrix2))
``````

## create empty data-frame with rownames as the colnames of testmatrix1

``````output <- data.frame(row.names = colnames(testmatrix1))
``````

## double loop to calculate p-values

``````for (j in 1:ncol(testmatrix1)) {
for (i in 1:ncol(testmatrix2)) {
a <- cor.test(testmatrix1[,j], testmatrix2[,i], method = "spearman")
output[j,i] <- a\$p.value
colnames(output)[i] <- colnames(testmatrix2)[i]
}
}

output
testmatrix2.Sample1 testmatrix2.Sample2 testmatrix2.Sample3
testmatrix1.Sample1           0.3518863          0.47606368          0.10102978
testmatrix1.Sample2           0.3961174          0.56423988          0.99568219
testmatrix1.Sample3           0.7408607          0.39592930          0.29648679
testmatrix1.Sample4           0.9291928          0.98327153          0.18853045
testmatrix1.Sample5           0.6518009          0.07268147          0.08976792
testmatrix1.Sample6           0.8339552          0.09669928          0.35602726
testmatrix2.Sample4 testmatrix2.Sample5 testmatrix2.Sample6
testmatrix1.Sample1           0.9092094           0.3305275          0.82292450
testmatrix1.Sample2           0.8855989           0.6614349          0.37536888
testmatrix1.Sample3           0.1092753           0.9206014          0.56385096
testmatrix1.Sample4           0.5927385           0.1078176          0.08340433
testmatrix1.Sample5           0.4327882           0.6604728          0.28091846
testmatrix1.Sample6           0.7108822           0.3686982          0.89042879
testmatrix2.Sample7 testmatrix2.Sample8 testmatrix2.Sample9
testmatrix1.Sample1          0.39665635           0.6124580           0.9679599
testmatrix1.Sample2          0.19092646           0.5742919           0.2318132
testmatrix1.Sample3          0.03735899           0.2470271           0.9994461
testmatrix1.Sample4          0.16147931           0.6147381           0.3380712
testmatrix1.Sample5          0.07368838           0.3759070           0.2812890
testmatrix1.Sample6          0.08264656           0.5687919           0.2017268
testmatrix2.Sample10 testmatrix2.Sample11
testmatrix1.Sample1           0.25252290            0.6361464
testmatrix1.Sample2           0.09881259            0.1207577
testmatrix1.Sample3           0.85420424            0.6877239
testmatrix1.Sample4           0.14445696            0.5322093
testmatrix1.Sample5           0.35637973            0.5900293
testmatrix1.Sample6           0.42454106            0.5896109
testmatrix2.Sample12
testmatrix1.Sample1            0.4367764
testmatrix1.Sample2            0.5291499
testmatrix1.Sample3            0.7566900
testmatrix1.Sample4            0.3309439
testmatrix1.Sample5            0.2748988
testmatrix1.Sample6            0.6713377
``````

## Now, check some values to ensure that it is correct:

``````cor.test(
testmatrix1[,"testmatrix1.Sample4"],
testmatrix2[,"testmatrix2.Sample7"],
method = "spearman")\$p.value
[1] 0.1614793

cor.test(
testmatrix1[,"testmatrix1.Sample6"],
testmatrix2[,"testmatrix2.Sample3"],
method = "spearman")\$p.value
[1] 0.3560273

cor.test(
testmatrix1[,"testmatrix1.Sample1"],
testmatrix2[,"testmatrix2.Sample11"],
method = "spearman")\$p.value
[1] 0.6361464
``````

Note that, for a larger data-matrix, we could compute-parallelise this with the `foreach` function and the `%dopar%` operator

Thanks for the beautiful explanation, but my doubt is.. the row names(genes) of those two matrix is not in the same order.

My intention is to find whether the gene expression pattern is same or not. I used this, I'm not sure it whether it will stay in same order while using cor.test()

``````x<- x[order(rownames(x)),]; y<- y[order(rownames(y)),]
``````
1

Both matrices have 300 genes, right? You want to ensure that the order of the genes is the same?

After you use the `order()` function, a simple check like this can be performed:

``````table(rownames(x) == rownames(y))
``````

If you see just `TRUE`, then the gene names are the same and in the same order.

If you see a `FALSE`, then there is a discrepancy. You can use `which()` and `match()` to harmonise them.

That's TRUE, thanks....

1. I'm not sure why I'm getting Heatmap like this. This code is not working, only for this data.

z<- output; zz<- melt(as.matrix(z)) ggplot(data = zz, aes(x=Var2, y=Var1, fill=value)) + geom_tile()

heatmap

1. And also If I plot the matrix and transpose of matrix i.e t(x), it should give the same kind of heat map right? it's giving different one!!

here is correlation matrix

``````> head(output)
1                  2                  3                  4
la -0.237894537953645 -0.251502630862629 -0.264305063132588 -0.245031084597787
lb -0.215517768370364 -0.229088729495185 -0.242997143937622 -0.224441452675199
lc -0.231200458977979 -0.246982412009137 -0.261266470629486 -0.243472483950217
ld -0.242659217031868 -0.258090770669682 -0.272427581120237 -0.253981579777381
le -0.246959115443146 -0.262707629126362  -0.27741638335168  -0.26039059644323
lf -0.253397678701017 -0.269383847422869 -0.285295861526805  -0.26807596290528
``````

The output that you get (in the figure) tells me that your numerical data is encoded as categorical, for whatever reason. Please check each step and use the function `str()` to see where this automatic conversion is taking place. Sometimes it may happen if there is a 'rogue' space or comma (or other non-numeric character) embedded in your numerical data somewhere.

Looking at your code, I suspect that `as.matrix(z)` is not doing what you think. Try `data.matrix()`, or just have `melt(x)`