Question: Correlation Matrix - Big Data - R
1
vincentpailler100 wrote:

Hello everyone,

I would like to create a correlation matrix from big data (about 1 000 000 genes & 40 samples)

I use R , and the size of the matrix is too big (4994.2 Go) . It can't be created. I work on a cluster with a big computing power so I don't think the problem comes from this cluster.

Have you got an idea to generate this matrix?

Bests,

Vincent

hadoop correlation matrix R • 974 views
modified 10 months ago by Jean-Karim Heriche21k • written 10 months ago by vincentpailler100

are you clustering the genes or the samples?

I am clustering genes

3
Nicolas Rosewick8.5k wrote:

Could you put the command you tried ? Do you want to compute correlation between samples ? or between genes ? also which species has 1,000,000 genes ?

For correlation between samples :

``````# generate test dataset - 40 samples x 1,000,000 genes
m <- matrix(runif(40e6,min = 0,max=100),nrow = 1000000,ncol = 40)
m <- as.data.frame(m)
colnames(m)<-paste0("sample",1:40)
row.names(m)<-paste0("gene",1:1000000)

# compute correlation
cor.res <- cor(m)
``````

For gene-gene correlation you will have to generate a 1,000,000 x 1,000,000 matrix that will be quiet big in memory ..

``````# Example of 1M x 1M matrix in R
m <- matrix(0,ncol=1e6,nrow=1e6)
Error: cannot allocate vector of size 7450.6 Gb
``````

Maybe you could try to find a solution by using the bigmemory or ff packages.

In fact someone already implemented a solution based on ff.

I would like to compute correlation between genes actually, to make clusters then. I work on metagenomics data.

I use "propr" package :

library(propr)

test<-t(test)

propr<-propr(test, metric="rho")

Alert: Replacing 0s with next smallest value. Alert: Saving log-ratio transformed counts to @logratio. Erreur : impossible d'allouer un vecteur de taille 4994.2 Go

2
Jean-Karim Heriche21k wrote:

There are various solutions to this. Here are a few suggestions:

1. Perform clustering on a random sample of the data then assign data points to the closest/most similar cluster.
2. Another, related, approach consists in removing as much of the data as possible in a preprocessing step, i.e. are all of the million genes really of interest ? Maybe a large fraction of them could be put into a group such as "not interesting because do not vary across samples".
3. Use an online clustering algorithm (e.g. online k-means).
4. If you need the whole correlation matrix, first notice that the matrix is symmetric so you actually only need one half of it (minus the diagonal), second, parallelize the computation by splitting it into groups of rows and store the results into a database. You can save space by only storing significant values (i.e. sufficiently different from 0). Clustering using the database can be done for example with agglomerative hierarchical clustering.