Question: Why Are Multiple Insert-Size Libraries More Effective In De Novo Assembly?
gravatar for Gingi
8.6 years ago by
Irvington, NY
Gingi330 wrote:

It's widely accepted that, pound for pound, using multiple short-read libraries with different insert sizes is more effective than a single insert size library for the generation of a _de novo_ assembly of short whole genome shotgun (WGS) reads. Is there a coherent, intuitive explanation why that is so? Does the effectiveness vary among de Bruijn graph (eulerian path) methods and overlap-consensus (hamiltonian path) methods? Is there any published research that discusses this with empirical results (e.g., simulations under varying parameters)?

assembly next-gen sequencing • 5.1k views
ADD COMMENTlink modified 5.9 years ago by Buttonwood40 • written 8.6 years ago by Gingi330
gravatar for Jts
8.6 years ago by
Jts1.2k wrote:

The reason multiple insert libraries are used is to strike a balance between long and short range information. Long-insert mate pair libraries are great at telling you two contigs are linked but doesn't tell you much about the sequence in between. Short-insert libraries can help you determine the exact sequence between two contigs but the information is local.

Consider this analogy. Your friend tells you he is going to drive from Los Angeles to New York. Initially, you don't know the exact cities he will visit in between - there are a huge number of possibile routes to take. When he tells you he is going to stop in Chicago, it helps constrain the possible routes. If he tells you he will also stop in Denver and Philadelphia, it helps even more. Each level of information helps reconstruct the whole path taken.

Most assemblers first construct contigs (either using de Bruijn graphs or overlap methods) then have a distinct scaffolding stage that operates on the contigs. In this situtation, it doesn't really matter what method was used to build the initial contigs. I don't know of any empirical studies of the impact of insert size choices but it will depend a lot on the repeat content of the genome.

ADD COMMENTlink written 8.6 years ago by Jts1.2k

+1 Nice answer and anology too!!!

ADD REPLYlink written 8.6 years ago by Rm7.9k
gravatar for Rm
8.6 years ago by
Danville, PA
Rm7.9k wrote:

Below articles throughs some light on the insert sizes and assembly efficiancies...

De novo assembly of short sequence reads

De novo assembly of human genomes with massively parallel short read sequencing

A new strategy for genome assembly using short sequence reads and reduced representation libraries

ADD COMMENTlink written 8.6 years ago by Rm7.9k

Thanks, those papers are useful. The SOAPdenovo paper (second link) mentions using step-wise insert sizes "to avoid interleaving," which might be one — if not the primary — reason for using multiple insert sizes. But they don't explore that any further.

ADD REPLYlink written 8.6 years ago by Gingi330
gravatar for Jeremy Leipzig
8.6 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

It looks like the first simulations of multiple mate pair fragment lengths for WGS were done in this paper: Pairwise end sequencing: a unified approach to genomic mapping and sequencing.

But that was all Sanger stuff with inserts ranging from 1kb to 40kb. That doesn't really tell us much about why short read paired end insert sizes ranging from say 200bp to 500bp should help resolve repeats when most retrotransposons and LINE elements are longer than that. Originally the excitement about paired ends was that quality trimmed or debarcoded Solexa reads were so damn short they could not be uniquely mapped/assembled as singlets.

With decent sized ~76bp reads I suspect paired ends help resolve very small repeated motifs within transcription units. Maybe there is some stochastic model to those motifs that might explain why varying fragment lengths would help (if they do). Sometimes I would turn off the paired-end module in Velvet and saw N50 drop anywhere from 1%-5%.

Great question.

ADD COMMENTlink written 8.6 years ago by Jeremy Leipzig18k
gravatar for Buttonwood
5.9 years ago by
Buttonwood40 wrote:

Can we used just one library for genome assembly?

ADD COMMENTlink written 5.9 years ago by Buttonwood40

Hi buttonwood, your post does not look like an answer to this question, but is another question entirely. You should try posting it as another question as long as it does not appear to be a duplicate

ADD REPLYlink written 5.9 years ago by cts1.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 600 users visited in the last hour