HiSeq 2500 chemistry enhancements empower the industry’s highest daily throughput and drive down the price of whole-genome sequencing. With support of paired 250 base pair read lengths in rapid run mode, the HiSeq 2500 will be capable of generating up to 300 Gb in rapid mode with sample to data in less than three days. These enhancements will be available in the second half of this year.
What will these much longer read lengths enable in terms of improved analysis?
I think longer reads in general, as long as there isn't an associated increase in error rates, are ultimately good. Even with just the human genome there are still many regions that are hard to do reliable mapping on due to repetitive elements and low sequence complexity. Longer reads can help with those areas (although the reference itself is of course still problematic there). And any overlap between pairs should, theoretically, help with error correction to na extent.
As long as the 3' base quality stays high enough to use close to 250bp, it seems you would have less ambiguous placement of split reads for RNA-seq. It also seems like you could select for ~300bp fragment sizes in your library and develop methods to detect base miscalls vs. PCR errors using even the lower quality 3' overlapping sequence. I'm not sure if in-read indel detection is restricted by discovery (number of reads containing mappable sequence flanking an indel) or whether it is computationally prohibitive to consider the gapped alignments.
It just occurred to me (and perhaps this is common knowledge) that when detecting structural rearrangements longer read lengths can have have detrimental effects.
Say you had a 100 bp translocation, you could easily identify that with 50 bp paired end reads having an unexpected insert size, they will still map inside the region. But if you had 250bp reads that would cover the entire region then none of them would map anymore and that will lead to a hole in both location. That is less information than before.