Question: Problem with memory requirements for SPades
0
gravatar for Tospohh
5 weeks ago by
Tospohh10
Tospohh10 wrote:

I am trying to assemble Illumina MiSeq paired-end reads from two Plasmodium falciparum samples. The genome is only about 20Mb long. The two samples are quite similar in the number of sequence reads with both having about 900,000 read pairs. I ran SPades on a machine with 250GB memory and one of the two samples was successfully assembled while the other one consistently crashes because it runs out of memory. I'm trying to understand why they are behaving so differently and what I can do about it now. Here is the commandline, wich is the same for both samples, just different FASTQ files, of course

spades.py -t 8 -o assembly_sample1b -1 fastq/1-1.fastq -2 fastq/1-2.fastq

I'm running this on a cluster node with 250GB memory. The LSF job manager output shows that the maximum memory used was 33927 MB.

The job that keeps failing shows that 378014 MB was required, i.e. 128GB more than the node had available. Consequently, the job was terminated by LSF and this keeps happening when I run again.

How can the memory requirements for two similar sized samples of reads from the same genome be so different? Can I reduce the memory requirements somehow? Speed isn't of the essence for me. Thanks!

CORRECTION The numbers of reads are actually 5.2 million for sample 1 (the one that works) and 5.6 million for sample 2. So it is a bit more than I thought but still they are quite similar numbers which makes we wonder why one requires so much more memory than the other. Thanks!

spades assembly • 122 views
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by Tospohh10

Have you considered normalizing the data before doing assemblies? You may have an over abundance of coverage here.

ADD REPLYlink written 5 weeks ago by genomax78k

No, I wasn't aware of that. I haven't done a lot of assemblies and the last one was years ago. So from reading the manual you are referring to, I gather that I should aim for about 100x coverage. Is that a general recommendation or does it depend on the genome in question and the assembler?

ADD REPLYlink written 5 weeks ago by Tospohh10

It is a general recommendation and every genome is going to be different. So you will need to experiment some.

ADD REPLYlink written 5 weeks ago by genomax78k

you can set the memory manually (use -m option). Default is 250 Gb. Pay attention to tmp directory as well. what is the maximum k-mer size you are using, in assembly?

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by cpad011212k

I tried various settings for -m but whatever I use, it always tries to use more than is available. And of course the above command was run on a machine that did have 250GB RAM but it ended up requesting almost 380GB despite -m being 250 by default. Max k was (automatically) set to 77

ADD REPLYlink written 5 weeks ago by Tospohh10

see if you can run kmer genie on your raw data and find out the optimum k-mer size. If it is below 77, you might want to reduce the k mer size so that memory requirements would be less. Usually higher the k-mer size, more is the RAM required. If spades is failing, you can use megahit as alternative. In my usage, megahit is less RAM hungry compared to spades.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by cpad011212k

Thanks! Wasn't aware of that tool either, will try it out!

ADD REPLYlink written 5 weeks ago by Tospohh10

btw, apparently '-m' option is a precaution. Please follow this official thread on on out of memory issue: https://github.com/ablab/spades/issues/19

ADD REPLYlink written 5 weeks ago by cpad011212k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1542 users visited in the last hour