Working with huge single-cell matrices in Seurat or monocle3
0
0
Entering edit mode
9 months ago
kchiou • 0

I've been stumped with how to work with large (>1 million cell) datasets in Seurat or monocle3, both of which first convert their expression matrices into sparse matrices.

I'm currently working with a 14693 x 1093036 (gene x cell) matrix containing 3744232095 (>3.7 billion) nonzero values. I am finding that reading the matrix into R as a regular matrix works fine, but converting it into sparse format with Matrix::Matrix(x,sparse=TRUE) fails with the error "Error: cannot allocate vector of size 119.7 Gb".

I next tried to convert this to sparse format by writing smaller pieces of the matrix to the hard disk in MatrixMarket (.mtx) format, combining them all outside of R (adjusting row indexes as necessary and writing a header), and then reading it back in with readMM('matrix.mtx'). The resulting sparse matrix works well (it can load into python with scipy.io.mmread()), but fails to import into R with Matrix::readMM(). Now it is giving the error:

Error in validityMethod(as(object, superClass)) :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:535


I've tried running this on our university HPC with 2 Tb of memory and tried maximizing the vector heap minimum (--min-vsize) and I still get these errors. Am I hitting the limits for vector storage in R? I don't see any way of proceeding with workflows in Seurat or monocle3 without getting past this issue of huge matrices. Any help or advice would be appreciated!

R rna-seq seurat monocle3 • 950 views
0
Entering edit mode

How much memory are you allocating yourself from the HPC? The memory error is telling you that it can't allocate that memory on top of what is already being used by R.

0
Entering edit mode

I've replicated the memory error even when allocated an entire node with 2 Tb of memory. So the hardware seems to be enough.

0
Entering edit mode

Do you mind posting some more information? The size of the matrix with print(object.size(mat), units="Gb"), and your memory allocation using free -h on the linux command line.

0
Entering edit mode

Thanks for following up! I just followed your instructions on a 248 Gb node (our larger nodes are not free at the moment).

The object size of the gene x cell matrix in R is 126.6 Gb

And here is the output of free -h

              total        used        free      shared  buff/cache   available
Mem:           251G        2.2G        246G        211M        3.1G        248G
Swap:          4.0G        2.2G        1.8G