Entering edit mode
4.2 years ago
Bogdan
★
1.4k
Dear all,
i would like for your suggestions please about the computing resources needed for scRNA-seq analysis running on 12 (or more) samples (each sample has 5000 - 6000 cells); we could use Seurat, Liger, Harmony, or SimpleSingleCell pipelines ;
what would be the minimum RAM needed (having 64GB is not sufficient) ;
and would you recommend to do all the data processing on Google Cloud, AWS, or on any other platform ?
thank you !
bogdan
I have run more cells than that with less than 64GB, so that part may be hard to predict based on the number of cells alone.
tthank you very much, Igor, for sharing your experience in the scRNAseq pipelines :)
Are you talking about the initial preprocessing (like 10X's CellRanger) or downstream analysis? Which tool did you use that didn't like the 64G limit? I know that CellRanger by default tries to use 90% of all memory on the machine and therefore causes troubles; you have to explicitly specify the amount of memory it can use. In my experience, most single cell pipelines do not need that much memory, but you have to be careful explicitly specifying it, as in the example with Cell Ranger
yes, thank you for asking for more details. we typically use CellRanger on a SLURM cluster, and we have lots of resources there.
after we obtain the matrices of counts for all the samples, in order to prototype the scRNAseq pipeline, i have been using my Ubuntu station (that has 64GB RAM). The pipeline prototype consists in Seurat 3, and Conos, Liger, Harmony for batch corrections (according to : https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1850-9). thanks :)
This is difficult to say since it highly depends on the application. I analyzed my four 10X samples (5000-7000) cells all on a Macbook Pro with 16GB RAM and had no problems. Indeed if you have a lot of samples you might need more RAM since the complexity of some algorithms might scale quadratically or even cubic. At which step did you run out of memory?
thank you for sharing your experience, i remember the step :
when the computer signaled the run out of the memory, although i will re-run during this coming week, and will let you know.
just a note to add, the pipeline suggested by the authors :
https://satijalab.org/seurat/v3.1/integration.html
using a reference-based approach seems to be working well. To quote :
"we present an additional modification to the Seurat integration workflow, which we refer to as ‘Reference-based’ integration. In the previous workflows, we identify anchors between all pairs of datasets. While this gives datasets equal weight in downstream integration, it can also become computationally intensive. For example, when integrating 10 different datasets, we perform 45 different pairwise comparisons.
As an alternative, we introduce here the possibility of specifying one or more of the datasets as the ‘reference’ for integrated analysis, with the remainder designated as ‘query’ datasets. In this workflow, we do not identify anchors between pairs of query datasets, reducing the number of comparisons. For example, when integrating 10 datasets with one specified as a reference, we perform only 9 comparisons."