Question: Working with huge single-cell matrices in Seurat or monocle3
0
gravatar for kchiou
10 weeks ago by
kchiou0
United States
kchiou0 wrote:

I've been stumped with how to work with large (>1 million cell) datasets in Seurat or monocle3, both of which first convert their expression matrices into sparse matrices.

I'm currently working with a 14693 x 1093036 (gene x cell) matrix containing 3744232095 (>3.7 billion) nonzero values. I am finding that reading the matrix into R as a regular matrix works fine, but converting it into sparse format with Matrix::Matrix(x,sparse=TRUE) fails with the error "Error: cannot allocate vector of size 119.7 Gb".

I next tried to convert this to sparse format by writing smaller pieces of the matrix to the hard disk in MatrixMarket (.mtx) format, combining them all outside of R (adjusting row indexes as necessary and writing a header), and then reading it back in with readMM('matrix.mtx'). The resulting sparse matrix works well (it can load into python with scipy.io.mmread()), but fails to import into R with Matrix::readMM(). Now it is giving the error:

Error in validityMethod(as(object, superClass)) :
  long vectors not supported yet: ../../src/include/Rinlinedfuns.h:535

I've tried running this on our university HPC with 2 Tb of memory and tried maximizing the vector heap minimum (--min-vsize) and I still get these errors. Am I hitting the limits for vector storage in R? I don't see any way of proceeding with workflows in Seurat or monocle3 without getting past this issue of huge matrices. Any help or advice would be appreciated!

seurat rna-seq monocle3 R • 244 views
ADD COMMENTlink written 10 weeks ago by kchiou0

How much memory are you allocating yourself from the HPC? The memory error is telling you that it can't allocate that memory on top of what is already being used by R.

ADD REPLYlink written 10 weeks ago by rpolicastro2.3k

I've replicated the memory error even when allocated an entire node with 2 Tb of memory. So the hardware seems to be enough.

ADD REPLYlink written 10 weeks ago by kchiou0

Do you mind posting some more information? The size of the matrix with print(object.size(mat), units="Gb"), and your memory allocation using free -h on the linux command line.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by rpolicastro2.3k

Thanks for following up! I just followed your instructions on a 248 Gb node (our larger nodes are not free at the moment).

The object size of the gene x cell matrix in R is 126.6 Gb

And here is the output of free -h

              total        used        free      shared  buff/cache   available
Mem:           251G        2.2G        246G        211M        3.1G        248G
Swap:          4.0G        2.2G        1.8G
ADD REPLYlink written 10 weeks ago by kchiou0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 915 users visited in the last hour