I need to assemble a large metagenomics dataset from Illumina NextSeq reads. My read depth is approximately 20 million reads per sample (28 samples) and the concatenated R1 and R2 reads are 130 GB each. I'm using 64-threads and it's still not enough.
I've been using metaspades which has been doing a great job. This is the command I ran:
python /usr/local/packages/spades-3.9.0/bin/metaspades.py -t 64 -m 1000 -1 ./paired_1.fastq -2 ./paired_2.fastq -o . > spades.log
It crashed and here's the end of the output log:
==> spades.log <== 576G / 944G INFO General (distance_estimation.cpp : 226) Processing library #0 576G / 944G INFO General (distance_estimation.cpp : 132) Weight Filter Done 576G / 944G INFO DistanceEstimator (distance_estimation.hpp : 185) Using SIMPLE distance estimator <jemalloc>: Error in malloc(): out of memory. Requested: 256, active: 933731762176
It's obviously a memory issue. Has anyone had any success: (1) using either another assembler; (2) a method to collapse the data before hand; or (3) data processing that could give unbiased assemblies?
I do not want to assemble in stages because it is difficult to collapse the data into a single dataset.
We thought about randomly selecting R1 and R2 reads but is there another method?
This method seems interesting to do unsupervised clustering of the reads before hand but I haven't seen any application-based implementations.