Why does low-coverage library give more contiguous assembly than high-coverage library?
0
1
Entering edit mode
16 months ago
Timotheus ▴ 40

Hello,

I'm working with two Illumina whole-genome libraries for the same species, let's call them A (TruSeq Nano) and B (TruSeq PCR-free), both with 150 bp reads and 350 bp inserts. My pipeline to obtain draft assemblies:

  1. QC with BBtools
  2. SPAdes assembly
  3. Contaminant contig removal with BlobTools
  4. SPAdes assembly from reads mapping to target contigs
  5. Second round of contaminant contig removal with BlobTools
  6. Redundans on the remaining contigs to obtain haploid genome representation

Assembly A had ~3000 contigs and N50 of ~100 kb, assembly B had ~8000 contigs and N50 of 40 kb. Both were of similar total length (~100 Mb). However, when I mapped QC reads against them, A had read coverage of 24 and B of 120 (after excluding repeat regions).

How is it possible that a library with much lower sequencing depth of the target genome gives a more contiguous assembly? Would you be able to suggest strategies to investigate?

Illumina SPAdes assembly N50 • 1.0k views
ADD COMMENT
0
Entering edit mode

Interesting. How were those coverage depths (24 and 120) calculated (mean, median, ...)? And what determined the repeat regions? I wonder if you have more of an apples-to-oranges comparison than you expect, and it just isn't obvious with a couple summary values like these.

What I think I'd look into to investigate, roughly in order:

  • How many reads do you have at each step before assembly? What's the quality like?
  • Have you checked what the reads look like (BLAST or something) just to sanity-check that it really is what you expect?
  • Is there an existing reference for the genome you can map each to, rather than doing a de-novo assembly?
  • How do the sets of contigs compare? Do the smaller snippets from B fit into the longer contigs from A, or are they something else entirely?
ADD REPLY
0
Entering edit mode

Thanks for your suggestions. Any tool for comparing the assemblies at scale? Seems a difficult task with graphical viewers as they are highly fragmented - could maybe focus on just a few contigs.

ADD REPLY
0
Entering edit mode

Nothing I'm aware of, but I only know some basics in this area and I wouldn't be surprised if there is a tool for just this sort of thing. Personally I'd probably cobble something together with Biopython or even just BLAST... but if there is an existing genome reference you can use here, probably easier to just compare both sets (or just map the reads themselves) to that reference for both A and B. (Are you sure the reads you have in each set really look like that species you're expecting?)

ADD REPLY

Login before adding your answer.

Traffic: 2487 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6