Question: Correlation Matrix - Big Data - R
1
gravatar for vincentpailler
5 weeks ago by
vincentpailler50 wrote:

Hello everyone,

I would like to create a correlation matrix from big data (about 1 000 000 genes & 40 samples)

I use R , and the size of the matrix is too big (4994.2 Go) . It can't be created. I work on a cluster with a big computing power so I don't think the problem comes from this cluster.

Have you got an idea to generate this matrix?

Bests,

Vincent

hadoop correlation matrix R • 171 views
ADD COMMENTlink modified 5 weeks ago by Jean-Karim Heriche18k • written 5 weeks ago by vincentpailler50

are you clustering the genes or the samples?

ADD REPLYlink written 5 weeks ago by russhh4.2k

I am clustering genes

ADD REPLYlink written 5 weeks ago by vincentpailler50
3
gravatar for Nicolas Rosewick
5 weeks ago by
Belgium, Brussels
Nicolas Rosewick7.5k wrote:

Could you put the command you tried ? Do you want to compute correlation between samples ? or between genes ? also which species has 1,000,000 genes ?

For correlation between samples :

# generate test dataset - 40 samples x 1,000,000 genes
m <- matrix(runif(40e6,min = 0,max=100),nrow = 1000000,ncol = 40)
m <- as.data.frame(m)
colnames(m)<-paste0("sample",1:40)
row.names(m)<-paste0("gene",1:1000000)

# compute correlation
cor.res <- cor(m)

For gene-gene correlation you will have to generate a 1,000,000 x 1,000,000 matrix that will be quiet big in memory ..

# Example of 1M x 1M matrix in R
m <- matrix(0,ncol=1e6,nrow=1e6)
Error: cannot allocate vector of size 7450.6 Gb

Maybe you could try to find a solution by using the bigmemory or ff packages.

In fact someone already implemented a solution based on ff.

ADD COMMENTlink modified 5 weeks ago by zx87546.8k • written 5 weeks ago by Nicolas Rosewick7.5k

I would like to compute correlation between genes actually, to make clusters then. I work on metagenomics data.

I use "propr" package :

library(propr)

test<-read.table("/home/vipailler/PROJET_M2/raw/truelength2.prok2.uniref2.rare> tsv", h=T, row.names=1, sep="\t")

test<-t(test)

propr<-propr(test, metric="rho")

Alert: Replacing 0s with next smallest value. Alert: Saving log-ratio transformed counts to @logratio. Erreur : impossible d'allouer un vecteur de taille 4994.2 Go

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by vincentpailler50
1
gravatar for Jean-Karim Heriche
5 weeks ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche18k wrote:

There are various solutions to this. Here are a few suggestions:

  1. Perform clustering on a random sample of the data then assign data points to the closest/most similar cluster.
  2. Another, related, approach consists in removing as much of the data as possible in a preprocessing step, i.e. are all of the million genes really of interest ? Maybe a large fraction of them could be put into a group such as "not interesting because do not vary across samples".
  3. Use an online clustering algorithm (e.g. online k-means).
  4. If you need the whole correlation matrix, first notice that the matrix is symmetric so you actually only need one half of it (minus the diagonal), second, parallelize the computation by splitting it into groups of rows and store the results into a database. You can save space by only storing significant values (i.e. sufficiently different from 0). Clustering using the database can be done for example with agglomerative hierarchical clustering.
ADD COMMENTlink modified 5 weeks ago by zx87546.8k • written 5 weeks ago by Jean-Karim Heriche18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1037 users visited in the last hour