Question: How to get a half correlation matrix?
0
gravatar for vincentpailler
29 days ago by
vincentpailler50 wrote:

Hi guys,

I would like to know if it is possible to compute the half of a correlation matrix? I mean, I don't want to compute the whole matrix, and then extract just the upper or the lower half - I want to get directly the half matrix?

I use R. I have already a code but it gives the whole matrix :

data<-read.table("/home/vipailler/PROJET_M2/raw/truelength2.prok2.uniref2.rares.tsv", sep="\t", h=T, row.names=1)+1

res <- foreach(i = seq_len(ncol(data)),
 .combine = rbind,
 .multicombine = TRUE,
 .inorder = FALSE,
 .packages = c('data.table', 'doParallel')) %dopar% {
 apply(data, 2, function(x) 1 - ((var(data[,i] - x)) / (var(data[,i]) + var(x))))
}

Thanks

correlation matrix R • 305 views
ADD COMMENTlink modified 22 days ago by zx87547.1k • written 29 days ago by vincentpailler50
1

For the record, this code is mine, from my previous answer:

ADD REPLYlink written 29 days ago by Kevin Blighe41k

I read some documentation about the functions upper.tri and lower.tri but I don't know if it could suitable in this code?

ADD REPLYlink written 29 days ago by vincentpailler50

Yes, I know what you are aiming to do. A data-frame/matrix is 'rectangular' in structure, though. If you just fill the upper part, you still have to have the bottom part, even if the cells are empty.

ADD REPLYlink written 29 days ago by Kevin Blighe41k

My goal here is to gain time. So, if I fill the upper or the lower part with 0 , and is I compute the correlations on the other part, do I will gain time?

ADD REPLYlink written 29 days ago by vincentpailler50

You may just have to be patient. You could add a line such as this to your foreach loop:

if ((i %% 100) == 0) {
  print(i)
}

Then it will print the value of i after 100 values are processed. You may see, for example, 300 coming before 200 based on how the parallel processing works if one core finishes before the other.

ADD REPLYlink written 28 days ago by Kevin Blighe41k

Correlations for what?

ADD REPLYlink written 29 days ago by ATpoint15k

It is correlations between OTUs

ADD REPLYlink written 29 days ago by vincentpailler50

You need to provide more information. Which language (R, python, etc...).

ADD REPLYlink written 29 days ago by Nicolas Rosewick7.5k

put in the original question

ADD REPLYlink modified 29 days ago by Nicolas Rosewick7.5k • written 29 days ago by vincentpailler50

The answer you get is at most as good as the question you ask:

awk 'BEGIN{FS="\t";print 1.0"\n"0.9,1.0"\n"0.5,0.8,1.0}'
1
0.9 1
0.5 0.8 1
ADD REPLYlink written 29 days ago by 5heikki8.4k

Sorry for the lack of information. I answered above.

ADD REPLYlink written 29 days ago by vincentpailler50

Could you show us how your data is formated ? thanks

ADD REPLYlink written 29 days ago by Nicolas Rosewick7.5k
1
gravatar for Nicolas Rosewick
29 days ago by
Belgium, Brussels
Nicolas Rosewick7.5k wrote:

Simple double for loop. The trick is to inject in the second for loop the index of the first loop as starting point :

 res <- matrix(NA,ncol(data),ncol(data))
  for(i in 1:ncol(data)){
    for(j in i:ncol(data)){
      res[i,j]<-cor(data[,i],data[,j])
    }
  }
ADD COMMENTlink written 29 days ago by Nicolas Rosewick7.5k

Thanks for your answer. But in your code, you need to create a matrix before, right? res <- matrix(NA,ncol(data),ncol(data)

I was asking if it could be possible to compute directly the upper or the lower triangle on my initial matrix, without creating another one before?

ADD REPLYlink written 29 days ago by vincentpailler50

Yes you create a matrix but you only compute the upper triangle

ADD REPLYlink written 29 days ago by Nicolas Rosewick7.5k

I know what you mean. But you also mean that it is impossible to compute directly the upper or the lower triangle on my first matrix? I necessarily need to create another matrix to the dimensions of my first matrix (res <- matrix(NA,ncol(data),ncol(data))), and then compute the correlations and add them on the matrix that I have created, which is a waste of time so?

Because my goal was to compute the upper triangle of my matrix (thanks to the code that Kevin Blighe gave it to me), without creating another matrix, in the aim to save time..

ADD REPLYlink written 29 days ago by vincentpailler50
1

there is no "upper-triangle" data structure in R in my knowledge. What is the size of your correlation matrix ?

ADD REPLYlink written 28 days ago by Nicolas Rosewick7.5k

Actually, my correlation matrix is very huge (up to 5To) . My input file has ~800.000 rows . I can't execute my code on this input file, it requires too much RAM. I will split this input file into several submatrices, then I will compute the correlations on these submatrices.

ADD REPLYlink written 28 days ago by vincentpailler50

As Nicolas said, there is no data structure in R to do this. However, you could simply output each line to a separate list entry, like this:

[[1]]
x x x x x x x x
[[2]]
x x x x x x x 
[[3]]
x x x x x x
[[4]]
x x x x x
[[5]]
x x x x
[[6]]
x x x
[[7]]
x x
[[8]]
x

You should be able to code for this.

ADD REPLYlink modified 28 days ago • written 28 days ago by Kevin Blighe41k

Thanks for your answers. I will stay with my first matrix from which I will extract the correlations I want.

ADD REPLYlink written 28 days ago by vincentpailler50

Okay, sure. À bientôt

ADD REPLYlink written 28 days ago by Kevin Blighe41k

I splitted my matrix into 10 submatrices with 80.000 rows each. I computed the correlations on these 10 submatrices. Then, I joined these correlations.

Does it look fine?

ADD REPLYlink written 25 days ago by vincentpailler50

How is the correlation going to occur between the submatrices?

ADD REPLYlink written 25 days ago by Kevin Blighe41k

I will use a "double loop" , I mean : - between the first submatrix and the 9 others - between the second one and the 8 others...

Is it the good way to proceed?

ADD REPLYlink written 25 days ago by vincentpailler50

Performing the correlation separately on subsets of the data is one way to do this - oui / yes. I believe this is how bigcor does it. Just make sure that you validate your results... and check it more than 5 times in different ways. Mistakes are always made by everyone.

ADD REPLYlink written 25 days ago by Kevin Blighe41k

From the code you gave it to me, do you think it is possible to add some code lines to compute the correlations between the submatrices? Or do I need to create a new function?

ADD REPLYlink written 24 days ago by vincentpailler50

I tried something (I checked the bigcor() code) :

> dim(data)
[1]   40 1087

 #My last block will contains the 1001th line to the 1087th one

ncol<-ncol(data)
if (!is.null(y)) ycol <- col(y)

rest<-ncol%%100 #100 columns by block, it will remains 87 columns for the last block
large<-ncol-rest
blocks<-ncol%/%100 #10 blocks (+ 1 block with 87 columns)

#to put the columns into each block

ngroup <- rep(1:blocks, each = 100)
  if (rest > 0) ngroup <- c(ngroup, rep(blocks+ 1, rest))
split <- split(1:ncol, group)

#that gives me all the pairwise comparison between the blocks .

combinaison <- expand.grid(1:length(split), 1:length(split))
  combs <- t(apply(combs, 1, sort))
  combs <- unique(combs)  
if (!is.null(y)) combs <- cbind(1:length(split), rep(1, length(split)))

      [,1] [,2]
  [1,]    1    1
  [2,]    1    2
  [3,]    1    3
  [4,]    1    4
  [5,]    1    5
  [6,]    1    6
....

But after, I don't know how could I create the nested loop to compute the correlations between these blocks..

ADD REPLYlink modified 24 days ago • written 24 days ago by vincentpailler50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 644 users visited in the last hour