Question

High coverage bacterial genome reads causing spurious assemblies

1

Entering edit mode

9.3 years ago

b2060780 ▴ 10

Hi,

I'm using SPAdes to assemble bacterial genome sequenced on illumina platforms.

I have a few files that I'm struggling to assemble due to the incredibly large coverage of my raw reads. Anything over 250x seems to cause spurious assemblies, with the final fasta being three times the expected size.

At first I thought it could be contamination, but annotation shows only target species genes present. I've heard SPAdes can struggle with particularly high coverage files - so what can I do to get them assembled? One file is ~500x coverage...

Thanks,

Assembly illumina spades • 4.4k views

ADD COMMENT • link updated 9.2 years ago by Rohit ★ 1.5k • written 9.3 years ago by b2060780 ▴ 10

score 5 · Answer 1 · 2016-04-11

Usually de-brujin assemblers work best around 60-80x coverage (probably even 100x), then the problem of spurious contigs appears. As suggested by others, do a normalisation step. bbnorm of bbmap is a really good normalisation tool that can get rid of low coverage regions and normalise highly covered regions to the expected coverage. Also, it has a nicely built-in pre-filtering step for sensitivity and a kmer value you can choose if required. It can be used it as follows -

bbnorm.sh in=input.fastq out=output.fastq target=80 mindepth=10 -Xmx200g threads=28 prefilter=t

score 2 · Answer 2 · 2016-03-17

2

Entering edit mode

9.3 years ago

Brian Bushnell 20k

It's also possible that you have contamination, which will bloat the assembly. Spades may generate a somewhat inferior assembly due to high coverage, but 3x the expected size due to 500x coverage would be extremely unusual in my experience - 5% too big would be more what I'd expect. It is designed to deal with super-high coverage, after all (though I still find normalization often improves its output). So, please BLAST your assembled contigs against a large database to make sure you are hitting what you expect. You can also analyze the kmer-frequency distribution, or do a contig-length versus coverage plot, or a coverage versus GC% plot, or just a GC% plot, to spot probable contamination.

ADD COMMENT • link 9.3 years ago by Brian Bushnell 20k

0

Entering edit mode

Brian, would you know whether anyone has evaluated whether bbnorm/khmer or other normalization techniques cause mis-assemblies? Titus Brown mentioned a few years ago that khmer may not be the best for highly repetitive genomes (e.g. plant, but some bacteria fall into this category), and I can see that reduction of repetitive sequences would removing formerly ambiguous points in the assembly and potentially lead to mis-joins.

ADD REPLY • link 7.4 years ago by Chris Fields ★ 2.2k

score 0 · Answer 3 · 2016-03-16

You can either subsample (e.g. seqtk sample command) or normalize your reads (using digital normalization) and pick whichever will yield better assemblies for you. If you search biostars, there have been many similar questions to yours already answered (I just tried "high coverage assembly" as keywords and found some very nice posts about the topic).

There is no need to use extremely high coverage data _just_ because that's what you have.

score 0 · Answer 4 · 2016-04-05

If your genome contains repeated sequences (CRISPRs, ISs...) you obtain contigs with an excess of coverage. You can estimate the number of copies by comparison with the median sequencing coverage of your larger contigs.

SPAdes produce a a graph file *.fastg*. You can open it with Bandage to show lins between your contigs.

score 0 · Answer 5 · 2016-04-11

You can try digital normalization to remove excess coverage. You can use programs like Khmer see this (http://khmer.readthedocs.org/en/v1.1/guide.html). You can also try Platanus assembler. The manual says it works best with >80x coverage. But they also say it needs mate-pair data for better assembly. You can try if it helps for your data.