Velocyto for smart seq2 taking an excessively long time (and memory)?
1
0
Entering edit mode
6 weeks ago
skjw1029 ▴ 30

I'm trying to run velocyto with the run-smartseq2 command on about 1500 bam files from alignment with subread-align.

The problem is, the job takes forever and failed. It ran for 14 hours before I think it exhausted the memory on the compute node (using an HPC, 128 GB RAM).

2021-08-03 08:40:40,913 - DEBUG - Reading /endosome/work/InternalMedicine/s184335/genome.med.nyu.edu/results/external/parklab/2018-05-09-WCMC/allfastq_files/MDS_trimmed_fqfiles/subread_aligned/C2-G12_S322_L006.bam
2021-08-03 08:45:20,476 - DEBUG - Counting for batch 69, containing 1 cells and 8673372 reads
2021-08-03 08:52:02,538 - DEBUG - 4299289 reads overlapping with features on plus strand
2021-08-03 08:52:02,538 - DEBUG - 4150071 reads overlapping with features on minus strand
2021-08-03 08:52:02,538 - DEBUG - 984169 reads overlapping with features on both strands
2021-08-03 08:54:12,717 - WARNING - The barcode selection mode is off, no cell events will be identified by <80 counts
2021-08-03 08:54:12,718 - WARNING - 0 of the barcodes where without cell
slurmstepd: error: get_exit_code task 0 died by signal


The velocyto manual says that running "a typical sample" should take about 6 hours.

I've tried re-running with a subset of about 40 bam files (each bam file is a cell), but the pace seems to be about the same; it hasn't completed at the time of writing and has been running for over 3 hours.

Looking at the log file, the vast majority of the time seems to be taken up by counting the reads in the bam files (above output).

The command I've run is this:

velocyto run-smartseq2 -o test_MDS_RNAvelocity -m ../hg38_rmsk.gtf -e MDS_HSC_RNAvelocity *.bam /endosome/work/InternalMedicine/s184335/genome_folder/alias/hg38/ensembl_gtf/default/hg38.gtf


Has anyone used velocyto for smart-seq2 data and experienced this sort of problem? Is this amount of time and resources used by velocyto normal? Surely 1589 cells shouldn't take this long to process?

Would there be any way to make it more efficient?

Edit:

It says that the program will determine the cell barcodes while reading the bam file, which might be the problem, but this is smart-seq2 data and the command run-smartseq2 does not have an option for specifying a barcode set.

2021-08-02 22:28:20,899 - WARNING - Each bam file will be interpreted as a DIFFERENT cell
2021-08-02 22:28:20,900 - DEBUG - Using logic: SmartSeq2
2021-08-02 22:28:20,900 - DEBUG - Cell barcodes will be determined while reading the .bam file


Also later on, the program says that the barcode selection mode is off.

Is there something wrong with the command or an option I'm forgetting to pass?

velocyto slurm RNAvelocity smart-seq2 • 274 views
0
Entering edit mode

Please do not paste screenshots of plain text content, it is counterproductive. You can copy paste the content directly here (using the code formatting option shown below), or use a GitHub Gist if the content volume exceeds allowed length here.

0
Entering edit mode
6 weeks ago

Velocyto used to take a long time to run and used a lot of memory for me. Modern software such as STARsolo or alevin-fry (Salmon) can generate the same spliced/unspliced/ambiguous count matrices in a fraction of the time and resources, so I switched to them.

After you generate the counts I highly recommend scVelo from the Theis lab instead of the Velocyto software too.

0
Entering edit mode

STARsolo and alevin-fry both seem to be primarily for 10X, Droplet single cell data.

For STARsolo can you feed in already aligned bam files from smart-seq2 data for --soloFeatures Gene Velocyto? The manual is a bit vague on what genome files you need to pass as well as how to pass pre-aligned bam files, if possible.