Question: Discovar De Novo assembly - "Attempt to allocate memory failed"
0
gravatar for maxwhjohn1988
4 months ago by
maxwhjohn198840 wrote:

As the title says, Discovar De Novo (52488 - I think this is some version identifier) keeps saying that it can't allocate memory - it then reliably aborts. This is driving me up the wall because I'm queueing often for days for access to a 1 Tb compute node on the HPC.

The details of my sequencing data:

A single PE library (Illumina HiSeq 2500, 2x250 bp, 500 bp insert). Originally ~ 120x depth but I have tried subsampling this to 50% of that and get the same error. Genome size is estimated at ~ 300 Mb.

The node I'm running on:

  • hardware type: x86_64

  • cache size: 35840 KB

  • cpu MHz: 2400.000

  • cpu model name: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz

  • physical memory: 1007.57 GB

The invocation I last used (this is the 50% subsampling I mentioned above):

DiscovarDeNovo READS=/scratch/genomicsocorg/mwhj1/Assemblies_2MPs/SC1702273-R3/20_50pc_R1.fastq,/scratch/genomicsocorg/mwhj1//Assemblies_2MPs/SC1702273-R3/20_50pc_R2.fastq OUT_DIR=/scratch/genomicsocorg/mwhj1/Assemblies_2MPs/Test4_20 MAX_MEM_GB=900 NUM_THREADS=28

The error message (this has popped up at different points during assembly and is always the same):

"Dang dang dang, we've got a problem. Attempt to allocate memory failed, memory usage before call = 38.87 GB."

Further up in the output log I can see, reliably, that peak memory usage always successfully gets into the 500 Gb range during steps prior to this error appearing.

Discovar De Novo suggested the following solutions:

  • Run without other competing processes (if that's the problem).

  • Run on a server having more memory, or reduce your input data amount.

  • Consider using the MAX_MEM_GB or MEMORY_CHECK options (if available).

I don't think 1 is an issue (but details of the top memory processes on the node, at the time of Discovar De Novo giving up are below) - IT services here agree that this is not the issue.

2 is not an option here as I am using our highest memory (1 Tb) nodes.

3 I have tried and does not seem to help.

Top memory processes on node, as reported in the output log, at the time of failure (I think all of these are from Discovar De Novo):

.0. our_new_handler(), in RunTime.cc:586

  1. __gnu_cxx::new_allocator<kmerrecord&lt;200> >::allocate(...), in new_allocator.h:104

  2. _Vector_base<kmerrecord&lt;200>, allocator<kmerrecord&lt;200> > >::_M_allocate(...), in stl_vector.h:168

  3. void vector<kmerrecord&lt;200>, allocator<kmerrecord&lt;200> > >::_M_emplace_back_aux<kmerrecord&lt;200> const&>(...), in vector.tcc:404

  4. vector<kmerrecord&lt;200>, allocator<kmerrecord&lt;200> > >::push_back(...), in stl_vector.h:911

  5. vec<kmerrecord&lt;200>, allocator<kmerrecord&lt;200> > >::push_back(...), in Vec.h:153

  6. KmerParcelVec<200ul>::ParseReadKmersForParcelIDs(...), in KmerParcelsBuilder.cc:331

  7. KmerParcelVec<200ul>::RunNextTask(...), in KmerParcelsBuilder.cc:408

  8. KmerParcelVecVec<200ul>::RunTasks(...), in KmerParcelsBuilder.cc:516

  9. ParcelProcessor<200ul>::operator()(unsigned long), in KmerParcelsBuilder.h:258

  10. void KmerParcelsBuilder::BuildTemplate<200ul>(), in KmerParcelsBuilder.cc:578

  11. KmerParcelsBuilder::Build(unsigned long), in KmerParcelsBuilder.cc:743 (discriminator 1)

  12. void MakeAlignsPathsParallelX<2ul>(...), in MakeAlignsPathsParallelX.cc:210

  13. base_vec_vec_to_mutmer_hits(...), in ReadsToPathsCoreX.cc:438 (discriminator 1)

  14. ReadsToPathsCoreX(...), in ReadsToPathsCoreX.cc:743

  15. ReadsToPathsCoreY(...), in ReadsToPathsCoreX.cc:796

Details of my trials and tribulations:

I have talked to IT services at my university about this issue. The first time it happened I only requested 350 Gb of memory from the scheduler - that job was allocated a 1 Tb node and hit this issue. I resubmitted it with a request for 1000 Gb of memory, and included Discovar De Novo's optional arguments MAX_MEM_GB (I put 1000) and MEMORY_CHECK - I got the same issue. IT services said they were confident that I was securing an entire 1 Tb node (the biggest we have here) with my scheduler options, and suggested giving a bit of overhead in the MAX_MEM_GB option, so I submitted the assembly again with MAX_MEM_GB=960. Discovar De Novo ran and checked the available memory and could only access 950 Gb, reduced my 960 figure to 950, then hit the same problem. I re-submitted it with MAX_MEM_GB=900 (and no MEMORY_CHECK option because I was worried about the assembler deciding to increase this figure if it could see more available - maybe this was a mistake) and got the same error. All of these attempts were using the full library and as Discovar De Novo's documentation says that it's designed for ~ 60x I subsampled my reads with seqtk to 50% of their original depth (using the same seed to keep read pairs together) and submitted that as an assembly. Same error.

A plea for help from someone inexperienced:

Am I doing something stupid? Do I need to interleave my fastq files or provide any extra options? If anyone has any help, advice, or words of support I would be extremely grateful - this is my first big project for my PhD and I want to tear into it - I'm utterly stuck though. If nobody can help specifically with Discovar De Novo, is there another assembler which anyone can suggest which would be suitable for assembling a single Illumina library?

ADD COMMENTlink modified 11 weeks ago • written 4 months ago by maxwhjohn198840
1
gravatar for maxwhjohn1988
11 weeks ago by
maxwhjohn198840 wrote:

For anyone looking at this post, who has the same problem: I have it fixed now.

I was pointed at a forked version of Discovar, developed by a group at the Earlham Institute in Norwich. They wanted to use Discovar De Novo to assemble a wheat genome and it just crashed, so they got into it and fiddled about and seem to have fixed this problem - I was a bit dubious at first due to the fact that I'm working with a small-ish haploid genome, whereas they wanted it to work with a big hexaploid genome, but fixing memory issues was the first thing they've done and I've had no problems using it.

https://github.com/bioinfologics/w2rap-contigger

http://bioinfologics.github.io/the-w2rap-contigger/

https://pdfs.semanticscholar.org/1d1e/3b1d6014dfbb4beb86c576cd85b5f7275150.pdf

ADD COMMENTlink written 11 weeks ago by maxwhjohn198840

Great that you got the problem solved! For future reference, when you run out of memory during assembly, I suggest the following steps (in addition to normalization which you have already done):

1) Adapter-trimming and quality-trimming.
2) Read merging.
3) Contaminant removal (particularly large contaminant organisms like human).
4) Error-correction.

All of these can greatly reduce the number of unique kmers in the dataset, which is directly related to the amount of memory needed for assembly. Current versions of BBMap have a suggested assembly workflow in bbmap/pipelines/assemblyPipeline.sh which specifies the best order of operations.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Brian Bushnell15k
1
gravatar for genomax
4 months ago by
genomax39k
United States
genomax39k wrote:

Are you able to get discovar to successfully run at all (use the test data that came with discovar de novo)? Do you have exclusive access to this node (which IT generally hates to do, something like -x if you are using LSF) when this error happens (i.e. there is no one else running anything on the node)?

You could try running BBNorm from BBMap suite to reduce the complexity of the problem in an intelligent way.

ADD COMMENTlink written 4 months ago by genomax39k

Hi genomax,

Thanks very much for your reply. Discovar DeNovo has been successfully run with other data, so I know it's not a case of it not working properly in the first place. IT services here assured me that I was getting a whole node to myself with the scheduler options I'd supplied - I'd requested the full amount of RAM and the full number of threads for one of our highmem nodes, so there was no room for anything else at all.

I followed your recommendation of BBNorm, and normalised to 60x coverage - this then worked perfectly, and Discovar assembled my ~ 300 Mb genome in < 1.5 hours. Thank you so much for recommending this to me!

Max

ADD REPLYlink written 4 months ago by maxwhjohn198840

An update -

As I said above, I followed genomax's advice and ran BBNorm, which created read files which I successfully assembled.

However, I have another dataset - the same organism, the same library prep, the same sequencing (they were sequenced in the same lane on the same sequencer!)

I didn't try assembling it until I had the previous sample working. After success via BBNorm, I performed the exact same steps as before on the 2nd dataset. It failed during assembly with the exact same error message as I reported in my original post. The only thing which I could think of trying was to lower the number of threads, from 28 down to 8, and in doing so increase the available memory per thread. Hopefully this will work - I will post here to confirm.

ADD REPLYlink written 4 months ago by maxwhjohn198840

the same organism, the same library prep, the same sequencing

Then the thing to try would be to BBNorm both of them together and see if you are able to complete the run successfully.

they were sequenced in the same lane on the same sequencer!

That is not possible, unless there was a different index for that set, which would mean a different library prep was made.

ADD REPLYlink written 4 months ago by genomax39k

Hi genomax

Thanks for replying again so quickly!

Normalising them both together is not an option - the reason I've got two samples is because, although they are from the same species, I'm looking for a genomic signature associated with different phenotype. Assuming that BBNorming them both together and completing a run would necessitate merging the two datasets then this would destroy the purpose of what I'm trying to achieve - I'm trying to get a separate assembly for each sample to look for difference between them.

I'm even less of an expert on sequencing than I am on assembly (and I'm clearly no expert on assembly) but the two samples were prepared as separate libraries. The quote for the sequencing gives a price per lane, and that's the total price for both samples together, so as I understand it the two samples were prepared as separate libraries and ran together - just with different barcode identifiers.

ADD REPLYlink written 4 months ago by maxwhjohn198840

Ah. You were only referring to "same" method being used for two different samples. It makes sense that they have independent barcodes and can thus be run in one lane. Every library is going to have its own characteristics. Did you generate a histogram for the k-mers when you ran BBNorm for these samples?

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax39k

Tagging Brian Bushnell to see if he has any suggestions.

ADD REPLYlink written 4 months ago by genomax39k

Hi again,

Yes, sorry for not being more clear about that.

No, I didn't get a histogram out of BBNorm, just the normalised fastq files. I'll set that running now.

Thanks again for helping with this, it's good to have experience minds looking at the problem.

ADD REPLYlink written 4 months ago by maxwhjohn198840

A postdoc colleague has told me that he experienced the same issue with Discovar De Novo. He said that it was specific to particular libraries and that he never managed to resolve it. He ended up using a different assembler. I'm now running this assembly with SPAdes.

If anyone does come up with any ideas on how to fix this issue, though, I'd still love to hear about it.

ADD REPLYlink written 3 months ago by maxwhjohn198840
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1239 users visited in the last hour