Question

Correlation Matrix - Big Data - R

1

Entering edit mode

5.2 years ago

pablo ▴ 300

Hello everyone,

I would like to create a correlation matrix from big data (about 1 000 000 genes & 40 samples)

I use R , and the size of the matrix is too big (4994.2 Go) . It can't be created. I work on a cluster with a big computing power so I don't think the problem comes from this cluster.

Have you got an idea to generate this matrix?

Bests,

Vincent

R Hadoop correlation matrix • 4.3k views

ADD COMMENT • link updated 5.2 years ago by Jean-Karim Heriche 27k • written 5.2 years ago by pablo ▴ 300

0

Entering edit mode

are you clustering the genes or the samples?

ADD REPLY • link 5.2 years ago by russhh 5.7k

0

Entering edit mode

I am clustering genes

ADD REPLY • link 5.2 years ago by pablo ▴ 300

zx8754 · Answer 1 · 2019-02-14

3

Entering edit mode

5.2 years ago

Nicolas Rosewick 11k

Could you put the command you tried ? Do you want to compute correlation between samples ? or between genes ? also which species has 1,000,000 genes ?

For correlation between samples :

# generate test dataset - 40 samples x 1,000,000 genes
m <- matrix(runif(40e6,min = 0,max=100),nrow = 1000000,ncol = 40)
m <- as.data.frame(m)
colnames(m)<-paste0("sample",1:40)
row.names(m)<-paste0("gene",1:1000000)

# compute correlation
cor.res <- cor(m)

For gene-gene correlation you will have to generate a 1,000,000 x 1,000,000 matrix that will be quiet big in memory ..

# Example of 1M x 1M matrix in R
m <- matrix(0,ncol=1e6,nrow=1e6)
Error: cannot allocate vector of size 7450.6 Gb

Maybe you could try to find a solution by using the bigmemory or ff packages.

In fact someone already implemented a solution based on ff.

ADD COMMENT • link updated 5.2 years ago by zx8754 11k • written 5.2 years ago by Nicolas Rosewick 11k

0

Entering edit mode

I would like to compute correlation between genes actually, to make clusters then. I work on metagenomics data.

I use "propr" package :

library(propr)

test<-read.table("/home/vipailler/PROJET_M2/raw/truelength2.prok2.uniref2.rare> tsv", h=T, row.names=1, sep="\t")

test<-t(test)

propr<-propr(test, metric="rho")

Alert: Replacing 0s with next smallest value. Alert: Saving log-ratio transformed counts to @logratio. Erreur : impossible d'allouer un vecteur de taille 4994.2 Go

ADD REPLY • link 5.2 years ago by pablo ▴ 300

zx8754 · Answer 2 · 2019-02-14

There are various solutions to this. Here are a few suggestions:

Perform clustering on a random sample of the data then assign data points to the closest/most similar cluster.
Another, related, approach consists in removing as much of the data as possible in a preprocessing step, i.e. are all of the million genes really of interest ? Maybe a large fraction of them could be put into a group such as "not interesting because do not vary across samples".
Use an online clustering algorithm (e.g. online k-means).
If you need the whole correlation matrix, first notice that the matrix is symmetric so you actually only need one half of it (minus the diagonal), second, parallelize the computation by splitting it into groups of rows and store the results into a database. You can save space by only storing significant values (i.e. sufficiently different from 0). Clustering using the database can be done for example with agglomerative hierarchical clustering.