Question: How to get a half correlation matrix?
0
vincentpailler110 wrote:

Hi guys,

I would like to know if it is possible to compute the half of a correlation matrix? I mean, I don't want to compute the whole matrix, and then extract just the upper or the lower half - I want to get directly the half matrix?

I use R. I have already a code but it gives the whole matrix :

``````data<-read.table("/home/vipailler/PROJET_M2/raw/truelength2.prok2.uniref2.rares.tsv", sep="\t", h=T, row.names=1)+1

res <- foreach(i = seq_len(ncol(data)),
.combine = rbind,
.multicombine = TRUE,
.inorder = FALSE,
.packages = c('data.table', 'doParallel')) %dopar% {
apply(data, 2, function(x) 1 - ((var(data[,i] - x)) / (var(data[,i]) + var(x))))
}
``````

Thanks

correlation matrix R • 714 views
modified 10 months ago by zx87548.8k • written 10 months ago by vincentpailler110
1

For the record, this code is mine, from my previous answer:

I read some documentation about the functions upper.tri and lower.tri but I don't know if it could suitable in this code?

Yes, I know what you are aiming to do. A data-frame/matrix is 'rectangular' in structure, though. If you just fill the upper part, you still have to have the bottom part, even if the cells are empty.

My goal here is to gain time. So, if I fill the upper or the lower part with 0 , and is I compute the correlations on the other part, do I will gain time?

You may just have to be patient. You could add a line such as this to your foreach loop:

``````if ((i %% 100) == 0) {
print(i)
}
``````

Then it will print the value of `i` after 100 values are processed. You may see, for example, 300 coming before 200 based on how the parallel processing works if one core finishes before the other.

Correlations for what?

It is correlations between OTUs

put in the original question

The answer you get is at most as good as the question you ask:

``````awk 'BEGIN{FS="\t";print 1.0"\n"0.9,1.0"\n"0.5,0.8,1.0}'
1
0.9 1
0.5 0.8 1
``````

Sorry for the lack of information. I answered above.

Could you show us how your data is formated ? thanks

1
Nicolas Rosewick8.6k wrote:

Simple double for loop. The trick is to inject in the second for loop the index of the first loop as starting point :

`````` res <- matrix(NA,ncol(data),ncol(data))
for(i in 1:ncol(data)){
for(j in i:ncol(data)){
res[i,j]<-cor(data[,i],data[,j])
}
}
``````

Thanks for your answer. But in your code, you need to create a matrix before, right? `res <- matrix(NA,ncol(data),ncol(data)`

I was asking if it could be possible to compute directly the upper or the lower triangle on my initial matrix, without creating another one before?

Yes you create a matrix but you only compute the upper triangle

I know what you mean. But you also mean that it is impossible to compute directly the upper or the lower triangle on my first matrix? I necessarily need to create another matrix to the dimensions of my first matrix (`res <- matrix(NA,ncol(data),ncol(data))`), and then compute the correlations and add them on the matrix that I have created, which is a waste of time so?

Because my goal was to compute the upper triangle of my matrix (thanks to the code that Kevin Blighe gave it to me), without creating another matrix, in the aim to save time..

1

there is no "upper-triangle" data structure in R in my knowledge. What is the size of your correlation matrix ?

Actually, my correlation matrix is very huge (up to 5To) . My input file has ~800.000 rows . I can't execute my code on this input file, it requires too much RAM. I will split this input file into several submatrices, then I will compute the correlations on these submatrices.

As Nicolas said, there is no data structure in R to do this. However, you could simply output each line to a separate list entry, like this:

``````[]
x x x x x x x x
[]
x x x x x x x
[]
x x x x x x
[]
x x x x x
[]
x x x x
[]
x x x
[]
x x
[]
x
``````

You should be able to code for this.

Thanks for your answers. I will stay with my first matrix from which I will extract the correlations I want.

Okay, sure. À bientôt

I splitted my matrix into 10 submatrices with 80.000 rows each. I computed the correlations on these 10 submatrices. Then, I joined these correlations.

Does it look fine?

How is the correlation going to occur between the submatrices?

I will use a "double loop" , I mean : - between the first submatrix and the 9 others - between the second one and the 8 others...

Is it the good way to proceed?

Performing the correlation separately on subsets of the data is one way to do this - oui / yes. I believe this is how bigcor does it. Just make sure that you validate your results... and check it more than 5 times in different ways. Mistakes are always made by everyone.

From the code you gave it to me, do you think it is possible to add some code lines to compute the correlations between the submatrices? Or do I need to create a new function?

I tried something (I checked the bigcor() code) :

``````> dim(data)
   40 1087

#My last block will contains the 1001th line to the 1087th one

ncol<-ncol(data)
if (!is.null(y)) ycol <- col(y)

rest<-ncol%%100 #100 columns by block, it will remains 87 columns for the last block
large<-ncol-rest
blocks<-ncol%/%100 #10 blocks (+ 1 block with 87 columns)

#to put the columns into each block

ngroup <- rep(1:blocks, each = 100)
if (rest > 0) ngroup <- c(ngroup, rep(blocks+ 1, rest))
split <- split(1:ncol, group)

#that gives me all the pairwise comparison between the blocks .

combinaison <- expand.grid(1:length(split), 1:length(split))
combs <- t(apply(combs, 1, sort))
combs <- unique(combs)
if (!is.null(y)) combs <- cbind(1:length(split), rep(1, length(split)))

[,1] [,2]
[1,]    1    1
[2,]    1    2
[3,]    1    3
[4,]    1    4
[5,]    1    5
[6,]    1    6
....
``````

But after, I don't know how could I create the nested loop to compute the correlations between these blocks..