Question: Correlation Matrix - Big Data - R
gravatar for vincentpailler
10 months ago by
vincentpailler100 wrote:

Hello everyone,

I would like to create a correlation matrix from big data (about 1 000 000 genes & 40 samples)

I use R , and the size of the matrix is too big (4994.2 Go) . It can't be created. I work on a cluster with a big computing power so I don't think the problem comes from this cluster.

Have you got an idea to generate this matrix?



hadoop correlation matrix R • 974 views
ADD COMMENTlink modified 10 months ago by Jean-Karim Heriche21k • written 10 months ago by vincentpailler100

are you clustering the genes or the samples?

ADD REPLYlink written 10 months ago by russhh4.9k

I am clustering genes

ADD REPLYlink written 10 months ago by vincentpailler100
gravatar for Nicolas Rosewick
10 months ago by
Belgium, Brussels
Nicolas Rosewick8.5k wrote:

Could you put the command you tried ? Do you want to compute correlation between samples ? or between genes ? also which species has 1,000,000 genes ?

For correlation between samples :

# generate test dataset - 40 samples x 1,000,000 genes
m <- matrix(runif(40e6,min = 0,max=100),nrow = 1000000,ncol = 40)
m <-

# compute correlation
cor.res <- cor(m)

For gene-gene correlation you will have to generate a 1,000,000 x 1,000,000 matrix that will be quiet big in memory ..

# Example of 1M x 1M matrix in R
m <- matrix(0,ncol=1e6,nrow=1e6)
Error: cannot allocate vector of size 7450.6 Gb

Maybe you could try to find a solution by using the bigmemory or ff packages.

In fact someone already implemented a solution based on ff.

ADD COMMENTlink modified 9 months ago by zx87548.8k • written 10 months ago by Nicolas Rosewick8.5k

I would like to compute correlation between genes actually, to make clusters then. I work on metagenomics data.

I use "propr" package :


test<-read.table("/home/vipailler/PROJET_M2/raw/truelength2.prok2.uniref2.rare> tsv", h=T, row.names=1, sep="\t")


propr<-propr(test, metric="rho")

Alert: Replacing 0s with next smallest value. Alert: Saving log-ratio transformed counts to @logratio. Erreur : impossible d'allouer un vecteur de taille 4994.2 Go

ADD REPLYlink modified 10 months ago • written 10 months ago by vincentpailler100
gravatar for Jean-Karim Heriche
10 months ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche21k wrote:

There are various solutions to this. Here are a few suggestions:

  1. Perform clustering on a random sample of the data then assign data points to the closest/most similar cluster.
  2. Another, related, approach consists in removing as much of the data as possible in a preprocessing step, i.e. are all of the million genes really of interest ? Maybe a large fraction of them could be put into a group such as "not interesting because do not vary across samples".
  3. Use an online clustering algorithm (e.g. online k-means).
  4. If you need the whole correlation matrix, first notice that the matrix is symmetric so you actually only need one half of it (minus the diagonal), second, parallelize the computation by splitting it into groups of rows and store the results into a database. You can save space by only storing significant values (i.e. sufficiently different from 0). Clustering using the database can be done for example with agglomerative hierarchical clustering.
ADD COMMENTlink modified 9 months ago by zx87548.8k • written 10 months ago by Jean-Karim Heriche21k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1108 users visited in the last hour