Question: High coverage bacterial genome reads causing spurious assemblies
gravatar for b2060780
3.5 years ago by
b206078010 wrote:


I'm using SPAdes to assemble bacterial genome sequenced on illumina platforms.

I have a few files that I'm struggling to assemble due to the incredibly large coverage of my raw reads. Anything over 250x seems to cause spurious assemblies, with the final fasta being three times the expected size.

At first I thought it could be contamination, but annotation shows only target species genes present. I've heard SPAdes can struggle with particularly high coverage files - so what can I do to get them assembled? One file is ~500x coverage...


illumina spades assembly • 1.7k views
ADD COMMENTlink modified 3.4 years ago by Rohit1.4k • written 3.5 years ago by b206078010
gravatar for Rohit
3.4 years ago by
Rohit1.4k wrote:

Usually de-brujin assemblers work best around 60-80x coverage (probably even 100x), then the problem of spurious contigs appears. As suggested by others, do a normalisation step. bbnorm of bbmap is a really good normalisation tool that can get rid of low coverage regions and normalise highly covered regions to the expected coverage. Also, it has a nicely built-in pre-filtering step for sensitivity and a kmer value you can choose if required. It can be used it as follows - in=input.fastq out=output.fastq target=80 mindepth=10 -Xmx200g threads=28 prefilter=t

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Rohit1.4k
gravatar for Brian Bushnell
3.5 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

It's also possible that you have contamination, which will bloat the assembly. Spades may generate a somewhat inferior assembly due to high coverage, but 3x the expected size due to 500x coverage would be extremely unusual in my experience - 5% too big would be more what I'd expect. It is designed to deal with super-high coverage, after all (though I still find normalization often improves its output). So, please BLAST your assembled contigs against a large database to make sure you are hitting what you expect. You can also analyze the kmer-frequency distribution, or do a contig-length versus coverage plot, or a coverage versus GC% plot, or just a GC% plot, to spot probable contamination.

ADD COMMENTlink written 3.5 years ago by Brian Bushnell16k

Brian, would you know whether anyone has evaluated whether bbnorm/khmer or other normalization techniques cause mis-assemblies? Titus Brown mentioned a few years ago that khmer may not be the best for highly repetitive genomes (e.g. plant, but some bacteria fall into this category), and I can see that reduction of repetitive sequences would removing formerly ambiguous points in the assembly and potentially lead to mis-joins.

ADD REPLYlink written 19 months ago by Chris Fields2.1k
gravatar for Biomonika (Noolean)
3.5 years ago by
State College, PA, USA
Biomonika (Noolean)3.1k wrote:

You can either subsample (e.g. seqtk sample command) or normalize your reads (using digital normalization) and pick whichever will yield better assemblies for you. If you search biostars, there have been many similar questions to yours already answered (I just tried "high coverage assembly" as keywords and found some very nice posts about the topic).

There is no need to use extremely high coverage data _just_ because that's what you have.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Biomonika (Noolean)3.1k
gravatar for MathGon
3.5 years ago by
MathGon10 wrote:

If your genome contains repeated sequences (CRISPRs, ISs...) you obtain contigs with an excess of coverage. You can estimate the number of copies by comparison with the median sequencing coverage of your larger contigs.

SPAdes produce a a graph file *.fastg*. You can open it with Bandage to show lins between your contigs.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by MathGon10
gravatar for Shyam
3.4 years ago by
United States
Shyam130 wrote:

You can try digital normalization to remove excess coverage. You can use programs like Khmer see this ( You can also try Platanus assembler. The manual says it works best with >80x coverage. But they also say it needs mate-pair data for better assembly. You can try if it helps for your data.

ADD COMMENTlink written 3.4 years ago by Shyam130
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2218 users visited in the last hour