As the title says, Discovar De Novo (52488 - I think this is some version identifier) keeps saying that it can't allocate memory - it then reliably aborts. This is driving me up the wall because I'm queueing often for days for access to a 1 Tb compute node on the HPC.
The details of my sequencing data:
A single PE library (Illumina HiSeq 2500, 2x250 bp, 500 bp insert). Originally ~ 120x depth but I have tried subsampling this to 50% of that and get the same error. Genome size is estimated at ~ 300 Mb.
The node I'm running on:
hardware type: x86_64
cache size: 35840 KB
cpu MHz: 2400.000
cpu model name: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
physical memory: 1007.57 GB
The invocation I last used (this is the 50% subsampling I mentioned above):
DiscovarDeNovo READS=/scratch/genomicsocorg/mwhj1/Assemblies_2MPs/SC1702273-R3/20_50pc_R1.fastq,/scratch/genomicsocorg/mwhj1//Assemblies_2MPs/SC1702273-R3/20_50pc_R2.fastq OUT_DIR=/scratch/genomicsocorg/mwhj1/Assemblies_2MPs/Test4_20 MAX_MEM_GB=900 NUM_THREADS=28
The error message (this has popped up at different points during assembly and is always the same):
"Dang dang dang, we've got a problem. Attempt to allocate memory failed, memory usage before call = 38.87 GB."
Further up in the output log I can see, reliably, that peak memory usage always successfully gets into the 500 Gb range during steps prior to this error appearing.
Discovar De Novo suggested the following solutions:
Run without other competing processes (if that's the problem).
Run on a server having more memory, or reduce your input data amount.
Consider using the MAX_MEM_GB or MEMORY_CHECK options (if available).
I don't think 1 is an issue (but details of the top memory processes on the node, at the time of Discovar De Novo giving up are below) - IT services here agree that this is not the issue.
2 is not an option here as I am using our highest memory (1 Tb) nodes.
3 I have tried and does not seem to help.
Top memory processes on node, as reported in the output log, at the time of failure (I think all of these are from Discovar De Novo):
.0. our_new_handler(), in RunTime.cc:586
__gnu_cxx::new_allocator<kmerrecord<200> >::allocate(...), in new_allocator.h:104
_Vector_base<kmerrecord<200>, allocator<kmerrecord<200> > >::_M_allocate(...), in stl_vector.h:168
void vector<kmerrecord<200>, allocator<kmerrecord<200> > >::_M_emplace_back_aux<kmerrecord<200> const&>(...), in vector.tcc:404
vector<kmerrecord<200>, allocator<kmerrecord<200> > >::push_back(...), in stl_vector.h:911
vec<kmerrecord<200>, allocator<kmerrecord<200> > >::push_back(...), in Vec.h:153
KmerParcelVec<200ul>::ParseReadKmersForParcelIDs(...), in KmerParcelsBuilder.cc:331
KmerParcelVec<200ul>::RunNextTask(...), in KmerParcelsBuilder.cc:408
KmerParcelVecVec<200ul>::RunTasks(...), in KmerParcelsBuilder.cc:516
ParcelProcessor<200ul>::operator()(unsigned long), in KmerParcelsBuilder.h:258
void KmerParcelsBuilder::BuildTemplate<200ul>(), in KmerParcelsBuilder.cc:578
KmerParcelsBuilder::Build(unsigned long), in KmerParcelsBuilder.cc:743 (discriminator 1)
void MakeAlignsPathsParallelX<2ul>(...), in MakeAlignsPathsParallelX.cc:210
base_vec_vec_to_mutmer_hits(...), in ReadsToPathsCoreX.cc:438 (discriminator 1)
ReadsToPathsCoreX(...), in ReadsToPathsCoreX.cc:743
ReadsToPathsCoreY(...), in ReadsToPathsCoreX.cc:796
Details of my trials and tribulations:
I have talked to IT services at my university about this issue. The first time it happened I only requested 350 Gb of memory from the scheduler - that job was allocated a 1 Tb node and hit this issue. I resubmitted it with a request for 1000 Gb of memory, and included Discovar De Novo's optional arguments MAX_MEM_GB (I put 1000) and MEMORY_CHECK - I got the same issue. IT services said they were confident that I was securing an entire 1 Tb node (the biggest we have here) with my scheduler options, and suggested giving a bit of overhead in the MAX_MEM_GB option, so I submitted the assembly again with MAX_MEM_GB=960. Discovar De Novo ran and checked the available memory and could only access 950 Gb, reduced my 960 figure to 950, then hit the same problem. I re-submitted it with MAX_MEM_GB=900 (and no MEMORY_CHECK option because I was worried about the assembler deciding to increase this figure if it could see more available - maybe this was a mistake) and got the same error. All of these attempts were using the full library and as Discovar De Novo's documentation says that it's designed for ~ 60x I subsampled my reads with seqtk to 50% of their original depth (using the same seed to keep read pairs together) and submitted that as an assembly. Same error.
A plea for help from someone inexperienced:
Am I doing something stupid? Do I need to interleave my fastq files or provide any extra options? If anyone has any help, advice, or words of support I would be extremely grateful - this is my first big project for my PhD and I want to tear into it - I'm utterly stuck though. If nobody can help specifically with Discovar De Novo, is there another assembler which anyone can suggest which would be suitable for assembling a single Illumina library?
Great that you got the problem solved! For future reference, when you run out of memory during assembly, I suggest the following steps (in addition to normalization which you have already done):
All of these can greatly reduce the number of unique kmers in the dataset, which is directly related to the amount of memory needed for assembly. Current versions of BBMap have a suggested assembly workflow in bbmap/pipelines/assemblyPipeline.sh which specifies the best order of operations.