Why Are Multiple Insert-Size Libraries More Effective In De Novo Assembly?
3
10
Entering edit mode
12.0 years ago
Gingi ▴ 330

It's widely accepted that, pound for pound, using multiple short-read libraries with different insert sizes is more effective than a single insert size library for the generation of a _de novo_ assembly of short whole genome shotgun (WGS) reads. Is there a coherent, intuitive explanation why that is so? Does the effectiveness vary among de Bruijn graph (eulerian path) methods and overlap-consensus (hamiltonian path) methods? Is there any published research that discusses this with empirical results (e.g., simulations under varying parameters)?

next-gen sequencing assembly • 6.5k views
ADD COMMENT
0
Entering edit mode

Can we used just one library for genome assembly?

ADD REPLY
1
Entering edit mode

Hi buttonwood, your post does not look like an answer to this question, but is another question entirely. You should try posting it as another question as long as it does not appear to be a duplicate

ADD REPLY
9
Entering edit mode
12.0 years ago
Jts ★ 1.4k

The reason multiple insert libraries are used is to strike a balance between long and short range information. Long-insert mate pair libraries are great at telling you two contigs are linked but doesn't tell you much about the sequence in between. Short-insert libraries can help you determine the exact sequence between two contigs but the information is local.

Consider this analogy. Your friend tells you he is going to drive from Los Angeles to New York. Initially, you don't know the exact cities he will visit in between - there are a huge number of possibile routes to take. When he tells you he is going to stop in Chicago, it helps constrain the possible routes. If he tells you he will also stop in Denver and Philadelphia, it helps even more. Each level of information helps reconstruct the whole path taken.

Most assemblers first construct contigs (either using de Bruijn graphs or overlap methods) then have a distinct scaffolding stage that operates on the contigs. In this situtation, it doesn't really matter what method was used to build the initial contigs. I don't know of any empirical studies of the impact of insert size choices but it will depend a lot on the repeat content of the genome.

ADD COMMENT
0
Entering edit mode

+1 Nice answer and anology too!!!

ADD REPLY
4
0
Entering edit mode

Thanks, those papers are useful. The SOAPdenovo paper (second link) mentions using step-wise insert sizes "to avoid interleaving," which might be one — if not the primary — reason for using multiple insert sizes. But they don't explore that any further.

ADD REPLY
2
Entering edit mode
12.0 years ago

It looks like the first simulations of multiple mate pair fragment lengths for WGS were done in this paper: Pairwise end sequencing: a unified approach to genomic mapping and sequencing.

But that was all Sanger stuff with inserts ranging from 1kb to 40kb. That doesn't really tell us much about why short read paired end insert sizes ranging from say 200bp to 500bp should help resolve repeats when most retrotransposons and LINE elements are longer than that. Originally the excitement about paired ends was that quality trimmed or debarcoded Solexa reads were so damn short they could not be uniquely mapped/assembled as singlets.

With decent sized ~76bp reads I suspect paired ends help resolve very small repeated motifs within transcription units. Maybe there is some stochastic model to those motifs that might explain why varying fragment lengths would help (if they do). Sometimes I would turn off the paired-end module in Velvet and saw N50 drop anywhere from 1%-5%.

Great question.

ADD COMMENT

Login before adding your answer.

Traffic: 2139 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6