It's widely accepted that, pound for pound, using multiple short-read libraries with different insert sizes is more effective than a single insert size library for the generation of a _de novo_ assembly of short whole genome shotgun (WGS) reads. Is there a coherent, intuitive explanation why that is so? Does the effectiveness vary among de Bruijn graph (eulerian path) methods and overlap-consensus (hamiltonian path) methods? Is there any published research that discusses this with empirical results (e.g., simulations under varying parameters)?
The reason multiple insert libraries are used is to strike a balance between long and short range information. Long-insert mate pair libraries are great at telling you two contigs are linked but doesn't tell you much about the sequence in between. Short-insert libraries can help you determine the exact sequence between two contigs but the information is local.
Consider this analogy. Your friend tells you he is going to drive from Los Angeles to New York. Initially, you don't know the exact cities he will visit in between - there are a huge number of possibile routes to take. When he tells you he is going to stop in Chicago, it helps constrain the possible routes. If he tells you he will also stop in Denver and Philadelphia, it helps even more. Each level of information helps reconstruct the whole path taken.
Most assemblers first construct contigs (either using de Bruijn graphs or overlap methods) then have a distinct scaffolding stage that operates on the contigs. In this situtation, it doesn't really matter what method was used to build the initial contigs. I don't know of any empirical studies of the impact of insert size choices but it will depend a lot on the repeat content of the genome.
Below articles throughs some light on the insert sizes and assembly efficiancies...
It looks like the first simulations of multiple mate pair fragment lengths for WGS were done in this paper: Pairwise end sequencing: a unified approach to genomic mapping and sequencing.
But that was all Sanger stuff with inserts ranging from 1kb to 40kb. That doesn't really tell us much about why short read paired end insert sizes ranging from say 200bp to 500bp should help resolve repeats when most retrotransposons and LINE elements are longer than that. Originally the excitement about paired ends was that quality trimmed or debarcoded Solexa reads were so damn short they could not be uniquely mapped/assembled as singlets.
With decent sized ~76bp reads I suspect paired ends help resolve very small repeated motifs within transcription units. Maybe there is some stochastic model to those motifs that might explain why varying fragment lengths would help (if they do). Sometimes I would turn off the paired-end module in Velvet and saw N50 drop anywhere from 1%-5%.