Question: How to assemble complex soil metagenome datasets?
gravatar for Lina F
3.5 years ago by
Lina F200
Boston, MA
Lina F200 wrote:

Hi all,

I have 27 soil WGS metagenome datasets and I am trying to assemble them into contigs that are at least 1000-2000 kb long. Each dataset on its own is 20-30 Gigabytes of paired-end read fastq files.

I first tried the Ray Meta assembler because it's supposed to run well in parallel. I was able to do that for most datasets but have gotten very short contigs (most are <500 kb). Then I found this paper that suggests it does better for low-complexity datasets.

I also took a look at Concoct and I think the strategy sounds like it makes sense, but the code on their github pages is woefully outdated and I'm not sure how much of it is still maintained. Also, it suggests combining all datasets into one and then trying to assemble it (a "coassembly") and using that for the downstream analysis, but that approach will be computationally challenging since my datasets are so large.

If anyone has any experience with assembling complex soil datasets, I'd love to do some brain storming, so please reach out!


soil metagenomics assembly • 3.6k views
ADD COMMENTlink modified 3.5 years ago by Joe18k • written 3.5 years ago by Lina F200
gravatar for Brian Bushnell
3.5 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

The best approach for contiguity is generally to coassemble if you have sufficient resources. Some assemblers (HipMer/MetaHipMer, Ray, and Omega/Disco) can distribute that to spread the memory use across multiple nodes... since you've tried Ray, you might try the others too and see if they give better results. For a single node, we've found Megahit gives the best results with the lowest resource consumption.

You can also try approaches such as binning (using e.g. Metabat) and then assembling just reads that map to each individual bin. Normalization, error-correction, and/or discarding low-depth reads can also improve assemblies. With Ray and Disco, both error-correction and merging paired reads prior to assembly increases continuity.

But in general it's not a solved problem, so you'll have to experiment a lot! Don't expect great continuity, though; complex metagenomes often yield an L50 (length) of 200bp or less.

Note that you may be able to bin the raw reads using a binning tool based on depth covariance if the 27 samples are different (different conditions, location, time, etc).

ADD COMMENTlink written 3.5 years ago by Brian Bushnell17k

Thanks for the feedback! I tried MegaHit and reduced assembly time for a single sample from 11 hours (for Ray) to 2 hours. This was on a large AWS EC2 instance.

I also managed to do a coassembly for 7 samples in 19 hours using Megahit, so this is very encouraging!

ADD REPLYlink written 3.5 years ago by Lina F200

Yep, Megahit is a great tool, and I highly recommend it.

I'd be remiss to mention, though, that speed is not the only metric you should be considering. Please, at a minimum, check the basic assembly stats (N50, L50) also. For example - I can guarantee you that BBMap's Tadpole is faster than Megahit or Ray, but that does not mean it's better. Rather, it has fewer misassemblies, but the contiguity is lower. The choice of assembler is dictated by your goals.

In general - I'd choose the assembler that gives the best results pursuant to your goal, rather than the fastest one. Sometimes that means the best contiguity (in which case I'd suggest SPAdes), sometimes that means the fewest misassemblies (in which case I'd suggest Tadpole), and sometimes that means the best balance of contiguity, accuracy, time, and resource usage (in which case I'd suggest Megahit).

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Brian Bushnell17k
gravatar for Joe
3.5 years ago by
United Kingdom
Joe18k wrote:

My suggestion would also have been CONCOCT. It's developed by Chris Quince in my department. It is still under active development but the docs etc are a bit out of date as you say.

It's specifically designed for assembling metagenomes though, so I'd give it a try.

ADD COMMENTlink written 3.5 years ago by Joe18k

Thanks for the feedback! I was able to do a coassembly of 7 samples (the first part of my dataset) using Megahit, and now running Concoct should be within reach :)

It's also good to hear that it's still under active development! I've been looking at the github directory -- is this a good place to keep an eye on for new developments?

ADD REPLYlink written 3.5 years ago by Lina F200
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1115 users visited in the last hour