I've been stumped with how to work with large (>1 million cell) datasets in Seurat or monocle3, both of which first convert their expression matrices into sparse matrices.
I'm currently working with a 14693 x 1093036 (gene x cell) matrix containing 3744232095 (>3.7 billion) nonzero values. I am finding that reading the matrix into R as a regular matrix works fine, but converting it into sparse format with
Matrix::Matrix(x,sparse=TRUE) fails with the error "Error: cannot allocate vector of size 119.7 Gb".
I next tried to convert this to sparse format by writing smaller pieces of the matrix to the hard disk in MatrixMarket (.mtx) format, combining them all outside of R (adjusting row indexes as necessary and writing a header), and then reading it back in with
readMM('matrix.mtx'). The resulting sparse matrix works well (it can load into python with
scipy.io.mmread()), but fails to import into R with
Matrix::readMM(). Now it is giving the error:
Error in validityMethod(as(object, superClass)) : long vectors not supported yet: ../../src/include/Rinlinedfuns.h:535
I've tried running this on our university HPC with 2 Tb of memory and tried maximizing the vector heap minimum (
--min-vsize) and I still get these errors. Am I hitting the limits for vector storage in R? I don't see any way of proceeding with workflows in Seurat or monocle3 without getting past this issue of huge matrices. Any help or advice would be appreciated!