Are there any studies of comparing lengths of 100, 150 and 250 bp (paired end) for variant calling (SNP detection). I was wondering how much we gain (true positive, false negatives) in we sequence a longer read by Ilumina (not PacBio).
We compared single vs paired-end and 50bp vs 150bp, all at the same genome coverage (20X). We found that paired-end reads offered the most improvement, largely by increasing mappability/coverage within repeated gene sequences (paralogs, pseudogenes, conserved domains). We observed a more modest increase in coverage with longer reads. Details are available here.
Longer reads and longer/variable insert sizes are both helpful for resolving repetitive areas, but it's also worth noting that longer individual reads improve indel-calling capability and accuracy. For example, I was able to call up to ~40bp insertions max from 2x100bp reads (using the raw mapping data, i.e., looking for insertion events fully contained in cigar strings). However, after extending and merging pairs to produce fused reads >400bp long, I was able to confidently detect insertions events over 200bp after mapping. The number of short insertions dwarfs the number of longer insertions; as I noted in an email:
This yields approximately 48000 insertions (~2700 longer than 36bp and ~400 longer than 100bp)
And SNPs again outnumber insertions by maybe 20-to-1. So this does not affect the majority of mutations, but then again, you can probably get the majority of mutations with 50bp single-ended reads. If you're interested in long indels (and particularly long insertions), it's worth considering longer reads and extending+merging pairs with longer and variable insert sizes. Although the above was based on 2x100 reads, 2x150 would have worked even better; longer reads are easier to extend and merge. (By "easier" I mean they can be made longer, with greater accuracy.)