I'm trying to run
velocyto with the
run-smartseq2 command on about 1500 bam files from alignment with
The problem is, the job takes forever and failed. It ran for 14 hours before I think it exhausted the memory on the compute node (using an HPC, 128 GB RAM).
2021-08-03 08:40:40,913 - DEBUG - Reading /endosome/work/InternalMedicine/s184335/genome.med.nyu.edu/results/external/parklab/2018-05-09-WCMC/allfastq_files/MDS_trimmed_fqfiles/subread_aligned/C2-G12_S322_L006.bam 2021-08-03 08:40:40,999 - DEBUG - Read first 0 million reads 2021-08-03 08:45:20,476 - DEBUG - Counting for batch 69, containing 1 cells and 8673372 reads 2021-08-03 08:52:02,537 - DEBUG - 1110320 reads in repeat masked regions 2021-08-03 08:52:02,538 - DEBUG - 4299289 reads overlapping with features on plus strand 2021-08-03 08:52:02,538 - DEBUG - 4150071 reads overlapping with features on minus strand 2021-08-03 08:52:02,538 - DEBUG - 984169 reads overlapping with features on both strands 2021-08-03 08:54:12,717 - WARNING - The barcode selection mode is off, no cell events will be identified by <80 counts 2021-08-03 08:54:12,718 - WARNING - 0 of the barcodes where without cell 2021-08-03 08:54:15,469 - DEBUG - Reading /endosome/work/InternalMedicine/s184335/genome.med.nyu.edu/results/external/parklab/2018-05-09-WCMC/allfastq_files/MDS_trimmed_fqfiles/subread_aligned/C2-H02_S321_L006.bam 2021-08-03 08:54:15,572 - DEBUG - Read first 0 million reads slurmstepd: error: get_exit_code task 0 died by signal
The velocyto manual says that running "a typical sample" should take about 6 hours.
I've tried re-running with a subset of about 40 bam files (each bam file is a cell), but the pace seems to be about the same; it hasn't completed at the time of writing and has been running for over 3 hours.
Looking at the log file, the vast majority of the time seems to be taken up by counting the reads in the bam files (above output).
The command I've run is this:
velocyto run-smartseq2 -o test_MDS_RNAvelocity -m ../hg38_rmsk.gtf -e MDS_HSC_RNAvelocity *.bam /endosome/work/InternalMedicine/s184335/genome_folder/alias/hg38/ensembl_gtf/default/hg38.gtf
Has anyone used
velocyto for smart-seq2 data and experienced this sort of problem?
Is this amount of time and resources used by
velocyto normal? Surely 1589 cells shouldn't take this long to process?
Would there be any way to make it more efficient?
It says that the program will determine the cell barcodes while reading the bam file, which might be the problem, but this is smart-seq2 data and the command
run-smartseq2 does not have an option for specifying a barcode set.
2021-08-02 22:28:20,899 - WARNING - Each bam file will be interpreted as a DIFFERENT cell 2021-08-02 22:28:20,900 - DEBUG - Using logic: SmartSeq2 2021-08-02 22:28:20,900 - DEBUG - Cell barcodes will be determined while reading the .bam file
Also later on, the program says that the barcode selection mode is off.
Is there something wrong with the command or an option I'm forgetting to pass?