I am trying to assemble Illumina MiSeq paired-end reads from two Plasmodium falciparum samples. The genome is only about 20Mb long. The two samples are quite similar in the number of sequence reads with both having about 900,000 read pairs. I ran SPades on a machine with 250GB memory and one of the two samples was successfully assembled while the other one consistently crashes because it runs out of memory. I'm trying to understand why they are behaving so differently and what I can do about it now. Here is the commandline, wich is the same for both samples, just different FASTQ files, of course
spades.py -t 8 -o assembly_sample1b -1 fastq/1-1.fastq -2 fastq/1-2.fastq
I'm running this on a cluster node with 250GB memory. The LSF job manager output shows that the maximum memory used was 33927 MB.
The job that keeps failing shows that 378014 MB was required, i.e. 128GB more than the node had available. Consequently, the job was terminated by LSF and this keeps happening when I run again.
How can the memory requirements for two similar sized samples of reads from the same genome be so different? Can I reduce the memory requirements somehow? Speed isn't of the essence for me. Thanks!
CORRECTION The numbers of reads are actually 5.2 million for sample 1 (the one that works) and 5.6 million for sample 2. So it is a bit more than I thought but still they are quite similar numbers which makes we wonder why one requires so much more memory than the other. Thanks!