Question: vector memory exhausted running dist() on a single ADT dataset
0
gravatar for cook.675
10 weeks ago by
cook.67510
cook.67510 wrote:

This is a cross-post from the Satija github forum; I thought I may get more eyes on this forum so I'm also posting here:

Session info:
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Im running a macbook with 8GB of RAM. I am following this vignette for CITESeq However I am only loading and working with the ADT data. I am starting the vignette from the "Cluster directly on protein levels" section.

Everything is fine until I get to the following command: adt.dist <- dist(t(adt.data)) which returns the following: Error: vector memory exhausted (limit reached?)

I have tried setting my R_MAX_VSIZE variable anywhere from 8GB to 700GB as suggested on stackoverflow when trying to troubleshoot this. I also check this value is correct when I load up R using Sys.getenv("R_MAX_VSIZE")

In order to maximize my chances of success, just prior to the troublesome line of code being executed I have cleared all unused objects from the work space and I also ran garbage collection.

When I do this I run mem_used() and it returns a value of 387 MB. object.size(adt.data) is the only variable in the workspace prior to running dist() and it returns a value of 212MB

I can't think of anything else to try. It doesn't feel like my machine is incapable of running this, it doesn't seem that big. Is there another solution to this problem? Please let me know if you'd like any additional information. Thanks so much!

Edit: I just tried running it on a friends machine and the same error came up only it said:

Error: cannot allocate vector of size 2025.0 Gb

Well I guess that's the problem.... I don't have 2 Tb.... is there a way to shrink this or run an alternate type of PCA in order to do the clustering with the ADT data alone?

Edit 2: I just tried changing R_MAX_VSIZE to 2200Gb and rerunning. The program accepted it and I let it cook for awhile and came back an hour later and got the following message:

R session Aborted. R encountered a fatal error. The session was terminated
rna-seq seurat • 230 views
ADD COMMENTlink modified 10 weeks ago by Jean-Karim Heriche21k • written 10 weeks ago by cook.67510
1
gravatar for Jean-Karim Heriche
10 weeks ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche21k wrote:

This error means that R can't find enough continuous memory. You may have enough RAM in total but if it's fragmented, you may only have access to small continuous bits. To mitigate the issue from within a program, one should try to allocate objects by decreasing size (i.e. big matrices first) and lifetime so that smaller objects can fit in the footprint left when larger ones are destroyed though this may not always be practical when it's not your own code. I've occasionally had success running gc() but it may have worked just by chance. Sometimes, the solution may be to reboot if other (often long-running) processes are holding up memory.

ADD COMMENTlink written 10 weeks ago by Jean-Karim Heriche21k

Yah Ive tried everything mentioned. Rebooting, closing every application, garbage collection, clearing all unused an unnessecary variables, increasing max allowable memory....

I tried it on a desktop we have with 32GB ram on windows 10 and I have the same error, I cant over come it no matter what. Im not sure what the next step is I guess submitting it to our campus computing cluster but I was really trying to avoid that.

the line right before the error is adt.data <- GetAssayData(Adt, slot = "data")) and I tried adding "as.sparse" right before the function call but that didn't help at all

ADD REPLYlink written 10 weeks ago by cook.67510

What is the dimension of the distance matrix you're trying to compute? Is it possible that you're not computing on the expected data (e.g. computing dist between rows vs between columns or wrong data in the data frame)? Also check that when starting from the middle of the vignette you haven't missed any data preprocessing steps.

ADD REPLYlink written 10 weeks ago by Jean-Karim Heriche21k

What is the dimension of the distance matrix you're trying to compute?

I'm not sure exactly. I think I'm still trying to understand the dist() function to figure this out. The dimensions of the matrix passed to dist() are 25 x 737280. What does the "t" do here? dist(t(x)) ?

is it possible that you're not computing on the expected data

Yes I will look into this some more presently......

Also check that when starting from the middle of the vignette you haven't missed any data preprocessing steps.

Double and triple checked it should be alright

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by cook.67510

t() takes the transpose so t(x) is in your case a 737280 x 25 matrix. dist(x) computes the distance between the rows of x so dist(t(x)) in your case computes a 737280 x 737280 distance matrix which would take ~4TB of RAM if stored as a dense matrix or ~2TB as a dist object.

ADD REPLYlink written 10 weeks ago by Jean-Karim Heriche21k

I see thanks for that, then yes that would be the correct data since the 25 columns are just identifiers and we would want to calculate dist between rows. I've been busy and haven't had a chance to send run this on our computing cluster to see if it will work but I will report back.

ADD REPLYlink written 10 weeks ago by cook.67510
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1625 users visited in the last hour