Question: R: Error In Pvclust Function While Clustering
1
gravatar for Diana
14 months ago by
Diana480
Diana480 wrote:

Hi all,

I'm trying to cluster RNA-seq data using pvclust function from pvclust package, it gives me this error: cannot allocate vector of length 1623767616 I'm wondering if this is because I have 40296 genes and its too much data?

My code is this:

test2<-read.csv("RNAseq_to_cluster.csv", sep=",")
test3<-test2[,2:4]  #columns contain samples
row.names(test3)<-test2$gene
matrix<-data.matrix(test3)
transpose= t(matrix)
pv <- pvclust(transpose, method.dist="correlation", method.hclust="average", nboot=1000)

Error in cor(x, method = "pearson", use = use.cor) : 
  cannot allocate vector of length 1623767616

EDIT: first few lines of the input file:

gene    sample1    sample2    sample3
Mar-01    4.19504    3.9006    4.15683
Mar-02    3.0554    3.4261    3.76675
Sep-02    77.1536    65.1284    76.4927
Mar-03    1.01555    1.28626    0.461987

Please help.

Thanks!

ADD COMMENTlink modified 14 months ago by Damian Kao10.0k • written 14 months ago by Diana480

Yeah there isn't enough memory to make a vector of that size. But I don't see why it would need to make a vector of that size for what you are doing. Can you post the first few lines of the csv input file?

ADD REPLYlink modified 14 months ago • written 14 months ago by Damian Kao10.0k

I've posted a few lines of the input file

ADD REPLYlink written 14 months ago by Diana480

Try repeating with less number of genes, to get an answer. I assume, you have reached the R memory limit of 4GB. Check this post and post for possible workarounds.

ADD REPLYlink written 14 months ago by Sukhdeep Singh4.6k

Statistically it's not a great idea to blow up a 40k × 3 dataset into a 40k × 40k correlation matrix

ADD REPLYlink written 14 months ago by Ben1.8k
0
gravatar for Damian Kao
14 months ago by
Damian Kao10.0k
UK
Damian Kao10.0k wrote:

I don't think you need to do much to your data input to run the pvclust function. The transposition of the data matrix might be the problem. Instead of finding pair-wise correlation for just 3 sets of data (sample1,2,3), the transposition might be telling pvclust to do it for 40,000 sets of data (genes).

Try just this:

data = as.matrix(read.csv('RNAseq_to_cluster.csv',sep=',',header=TRUE, row.name = 1))
pv <- pvclust(data, method.dist="correlation", method.hclust="average", nboot=1000)
ADD COMMENTlink modified 14 months ago • written 14 months ago by Damian Kao10.0k

pvclust clusters columns that's why I was using the transpose function otherwise it just clusters the samples whereas I want to cluster the genes according to their expression profiles in the 3 samples

ADD REPLYlink written 14 months ago by Diana480

I see. I skimmed through pvclust description and thought you just wanted to cluster by sample. Perhaps the package just wasn't designed to cluster that many columns? Are you specifically interested in the p-values pvclust generates? If not, there are plenty of generic hierarchical clustering scripts out there that will handle large amount of genes and run faster. Clustering using python's scipy is pretty fast. You might want to look at this also: http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm

ADD REPLYlink modified 14 months ago • written 14 months ago by Damian Kao10.0k
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 575 users visited in the last hour