Question: Compute the correlations between submatrices
2
pablo140 wrote:

Hi guys,

Related my previous post: How to get a half correlation matrix?

I need to compute the correlations between the columns of my matrix which has more than 800.000 rows in R. I have decided to split this matrix into submatrices (with 60.000 rows for each) and to compute the pairwise correlations between these submatrices.

I use SLURM. What I want to do is to distribute the computation of the correlation between two submatrices on a node of the cluster I use, in order to parallelize.

For the moment, I have created an argument which takes in account the number of columns of the main matrix I want to compute.

For example, with the command line `data=data[1:opt\$subset,]` (in R), I can compute the correlation from my 1st column to the 10.000th : for that, I have set up arrays in my SLurm code : `subset=\$((SLURM_ARRAY_TASK_ID*10000))` . I have defined 10 arrays, and so, the first one will compute the compute from the 1st to the `1*10000` th column, the second from the 1st to the `2*10000` = 20 000 th column ....

With this argument `data=data[(as.numeric(opt\$subset)-4999):opt\$subset,]` , I can compute the correlations into a block/submatrix with a number of columns defined . For example, if I want to create blocks of 5000 columns , I set up my argument as above and in SLURM , with my arrays : `subset=\$((SLURM_ARRAY_TASK_ID*5000))` . So , my first block will correspond from the `(1*5000) - 4999` = 1st column to the `1*5000` = 5000th column , the second block will correspond from the `(2*5000)-4999` = 5001 th column to the `2*5000` = 10 000 th column ..

My problem is here : the correlations are computed into these blocks independently . What I want to do is to compute the correlation between all of these blocks like this (=pairwise correlation between all the blocks) :

``````     [,1] [,2]
[1,]    1    1
[2,]    1    2
[3,]    1    3
[4,]    1    4
[5,]    1    5
[6,]    1    6
...
``````

until the block 6 between the block 6.

Cheers

R CODE (Kevin Blighe gave me the foreach function)

``````#load packages
library(compositions)
library(parallel)
library(doParallel)
library(optparse)

args <- commandArgs(trailingOnly = F)

# get options

option_list = list(
make_option(c("-s", "--subset"), type="character", default=NULL, help="Input file matrix ")
);

opt_parser= OptionParser(usage = "Usage: %prog -f [FILE]",option_list=option_list, description= "Description:")

opt = parse_args(opt_parser)

#main code

print('Set Up Cores')

cores<-32
options('mc.cores'=cores)
registerDoParallel(cores)

data=t(data)

##THIS IS MY ARGUMENT###

#data=data[(as.numeric(opt\$subset)-4999):opt\$subset,]
data=data[1:opt\$subset,]

res <- foreach(i = seq_len(ncol(data)),
.combine = rbind,
.multicombine = TRUE,
.inorder = FALSE,
.packages = c('data.table', 'doParallel')) %dopar% {
if((i%%1000)==0){
print(i)}
apply(data, 2, function(x) 1 - ((var(data[,i] - x)) / (var(data[,i]) + var(x))))
}
``````

SLURM CODE

``````#!/bin/bash
#SBATCH --nodes=1
#SBATCH -o slurmjob-%A-%a.out
#SBATCH --job-name=rho_blocks_5k
#SBATCH --mail-user vincentpailler@hotmail.fr
#SBATCH --partition=normal
#SBATCH --time=1-00:00:00
#SBATCH --mem=250G
#SBATCH --array=0-10

#Set up whatever package we need to run with

# SET UP DIRECTORIES

OUTPUT="\$HOME"/PROJET_M2/bin/propr/\$(date +"%Y%m%d")_parallel_blocks_32cpus_5000
mkdir -p "\$OUTPUT"

export FILENAME=/home/vipailler/PROJET_M2/bin/coefficient_rho.R

#Run the program

echo "Start job :"`date` >> "\$OUTPUT"/temp_"\$SLURM_ARRAY_TASK_ID".txt
echo "Start job :"`date`

Rscript \$FILENAME --subset \$subset  > "\$OUTPUT"/"\$SLURM_ARRAY_TASK_ID"

echo "Stop job : "`date` >> "\$OUTPUT"/temp_"\$SLURM_ARRAY_TASK_ID".txt
echo "Stop job : "`date`
``````

The output I get is this one :

``````OTU0001     OTU0004    OTU0014    OTU0016    OTU0017      OTU0027
OTU0001  1.00000000  0.96688301 0.80621218 0.16754758 0.40818524  0.155976198
OTU0004  0.96688301  1.00000000 0.81330915 0.18928670 0.43247749  0.187540302
OTU0014  0.80621218  0.81330915 1.00000000 0.23753965 0.57237416  0.222890740
OTU0016  0.16754758  0.18928670 0.23753965 1.00000000 0.64007329  0.775772234
OTU0017  0.40818524  0.43247749 0.57237416 0.64007329 1.00000000  0.445145905
OTU0027  0.15597620  0.18754030 0.22289074 0.77577223 0.44514590  1.000000000
...
``````

After, I rearrange my output with :

``````Df<-data.frame(var1=rownames(res)[row(res)[upper.tri(res)]],
var2=colnames(res)[col(res)[upper.tri(res)]],
corr=res[upper.tri(res)])
``````

in order to get :

``````    var1    var2          corr
1   OTU0001 OTU0004  0.9668830120
2   OTU0001 OTU0014  0.8062121821
3   OTU0004 OTU0014  0.8133091522
4   OTU0001 OTU0016  0.1675475819
5   OTU0004 OTU0016  0.1892866996
6   OTU0014 OTU0016  0.2375396470
7   OTU0001 OTU0017  0.4081852433
8   OTU0004 OTU0017  0.4324774863
...
``````
R matrix correlation foreach • 715 views
modified 19 months ago by RamRS30k • written 19 months ago by pablo140

I can only speak for myself, but you should really consider to make your question shorter and more on-point. I would not be motivated to go through a question that long to understand what the issue is.

Ok sorry, I was wondering it was too much long too.

I have a big matrix with more than 800.000 columns. I need to compute the correlations between these columns. There are too many ones to compute all at once. I have decided to split this big matrix into smaller submatrices (60.000 columns each).

So, I need to compute the correlations of these submatrices between each other but I get stucked. The R function I have can only compute the correlations on only one submatrix. I need to edit my R code but I don't really now how to do it.

I have created a small subset data (100 columns) to work on.

``````ncol<-ncol(data)

rest<-ncol%%10
blocks<-ncol%/%10

#creation of the blocks (the number of columns I put in each blocks : 10 blocks * 10 columns here)

ngroup <- rep(1:blocks, each = 100)
split <- split(1:ncol, ngroup)

#creation of the pairwise correlation between each group

combinaison <- expand.grid(1:length(split), 1:length(split))
combs <- t(apply(combs, 1, sort))
combs <- unique(combs)

[,1] [,2]
[1,]    1    1
[2,]    1    2
[3,]    1    3
[4,]    1    4
[5,]    1    5
[6,]    1    6
....
``````

I don't know if I have the good start to get an issue..

You will get more joy on stack overflow with a strictly R-programming question. Plus, I thought you had 800k rows, not 800k columns. Is your input a square matrix? If not, precisely what are it's dimensions

I asked for my problem on stackoverflow but I didn't get any answer back yet..

Actually, my starting matrix has 800.000 rows (OTUs) and 40 columns (samples). But I used data=t(data) to work on the columns.

OK. I still don't know the answer to the thing that is puzzling me, so I'll ask it a different way. What is the expected dimension of the correlation matrix: (800k * 800k) or (40 * 40)?

The expected dimension of the correlation matrix is 800k*800k. That's why I want to split my starting matrix into submatrices

How are you going to store the data?

`M <- matrix(rep(0, 64 * 10^10), nrow=8*10^5)`

`Error: cannot allocate vector of size 4768.4 Gb`

Actually, I work on a cluster with several nodes. My goal here is to split my main matrix into several submatrices (with 60 000 columns for each for example). Then, I want to compute the correlations between each submatrices by distributing the calculation on nodes (thanks to the parallelization) .

According to this distribution :

``````         [,1] [,2]
[1,]    1    1     #node1
[2,]    1    2     #node 2
[3,]    1    3     ....
[4,]    1    4
[5,]    1    5
[6,]    1    6
``````

I will arrange each output in order to get a file like this :

``````      var1    var2       corr
1   OTU0001 OTU0004  0.9668830120
2   OTU0001 OTU0014  0.8062121821
3   OTU0004 OTU0014  0.8133091522
4   OTU0001 OTU0016  0.1675475819
5   OTU0004 OTU0016  0.1892866996
``````

And to finish, I will gather together each one of these outputs in order to get only one file which will contain each unique pairwise correlation between the OTUs.

There seems to be a lot of overengineering going on here. Can we just fix your R code. First you need a function that can run correlation of one set of your OTUs against all of the rest of the OTUs and return a (var1, var2, corr) data.frame where var1 < var2. Use `cor(A[, my_indices], A)` and `as.data.frame.table` and `dplyr::rename/filter`. Don't roll your own correlation function.

Do you think it is really tricky to edit my own correlation function? Because it works really well for a number of columns I have chosen. If I edit it by creating a nested loop, could it work?

``````res <- foreach(i = seq_len(ncol(data)),
.combine = rbind,
.multicombine = TRUE,
.inorder = FALSE,
.packages = c('data.table', 'doParallel')) %dopar% {
apply(data, 2, function(x) 1 - ((var(data[,i] - x)) / (var(data[,i]) + var(x))))
}
``````

But why are you using a vector-by-vector implementation of `cor` that's slow and untested, when there is a perfectly good matrix by matrix version?

I found this coefficient in a publication "propr: An R-package for Identifying Proportionally Abundant Features Using Compositional Data Analysis" . I thought it was possible to edit my apply function in order to get what I want.

And I use it to practice with R because I am pretty new with this language.

Aaaahhhh, you're not doing correlation. I thought it looked strange.

Yes, sorry for the misunderstanding. I say "correlation" by through misuse of language.

But does my problem look more clearly or not?

Not to me. I don't understand why you've written all this cluster code before you've tested your R code works. Where do you specify the matrix that you are doing your proportionality test against?

I tested my R code on a small subset of my matrix (10 rows and 40 columns), and it works. I got a 40*40 dimensional matrix with the good results. So the code below is good

`````` data<-read.table("/home/vipailler/PROJET_M2/raw/truelength2.prok2.uniref2.rares.tsv", sep="\t", h=T, row.names=1)+1

data=t(data)
data=data[,1:40]
data=data[1:10,]

cores<-32
option ('mc.cores'=cores
registerDoParallel(cores)

res <- foreach(i = seq_len(ncol(data)),
.combine = rbind,
.multicombine = TRUE,
.inorder = FALSE,
.packages = c('data.table', 'doParallel')) %dopar% {
apply(data, 2, function(x) 1 - ((var(data[,i] - x)) / (var(data[,i]) + var(x))))
}
``````

Yes but your problem is that you aren't getting the inter-block proportionality results isn't it? How does this test that?

Yes that's right. I just want to compute the proportionality between each blocks (block1 vs block 1; block1 vs block2 ...... ) and gather all the outputs. The problem here is that my code compute the proportionality only on the number of columns of my data_file (up to a number of columns, according to my computational power) . So, we can consider that I execute my code only on one block (which is the data_file).

And I really don't know how could I modify my code to get the inter-block proportionality results.. I guess I can split my data_file into many submatrices but that's it for me for the moment..

No it isn't. There's more than one way to split your `data` up. You can use one set of column indices to pull out block_a and another to pull out block_b. Then iterate over the cols of block a and over the cols of block b. Whatever you do, write a function to do it and test that function.

I found out something, I created two small matrices (which could correspond to two blocks) : matrix1 and matrix2 (here, it is a test of correlation between the rows of the 2 matrices (not matter if I work on the rows or the columns, my goal is really to compute the correlation/proportionality between my blocks)) . I can compute the pairwise correlation between : matrix1 vs matrix 1 , matrix 1 vs matrix 2 , and matrix 2 vs matrix 2

``````cor_blocks<- function(a,b) {
M  <- rbind(a, b);
M  <- M - rowMeans(M);
M  <- M / sqrt(rowSums(M^2));
CR   <- tcrossprod(M)
edge <- dim(a)
CR   <- t(cr[1:edge, -c(1:edge)])
return(CR)
}
``````