5
3
Entering edit mode
8.8 years ago

From the Illumina press release

HiSeq 2500 chemistry enhancements empower the industry’s highest daily throughput and drive down the price of whole-genome sequencing. With support of paired 250 base pair read lengths in rapid run mode, the HiSeq 2500 will be capable of generating up to 300 Gb in rapid mode with sample to data in less than three days. These enhancements will be available in the second half of this year.


What will these much longer read lengths enable in terms of improved analysis?

illumina • 3.1k views
4
Entering edit mode
8.8 years ago
DG 7.2k

I think longer reads in general, as long as there isn't an associated increase in error rates, are ultimately good. Even with just the human genome there are still many regions that are hard to do reliable mapping on due to repetitive elements and low sequence complexity. Longer reads can help with those areas (although the reference itself is of course still problematic there). And any overlap between pairs should, theoretically, help with error correction to na extent.

3
Entering edit mode

Many repeats are still mappable given paired-end reads. With the standard 2*100bp, about 94-95% of human genome is callable. 2*250bp may not give you a big improvement. De novo assembly will greatly benefit from longer reads. Although in theory we can also use PE reads to assemble through Alu, in practice few (if any) assemblers are really working this out. Overlapping PE reads in ~400bp is much preferred for de novo assembly.

1
Entering edit mode

I agree that things will only be slight improvements for what I listed. I would argue that there is still a decent portion of that callable percentage that can still be problematic. We see this routinely even working with targeted Exome data. 250 bp reads probably won't improve much, but they may improve it slightly in some of these regions. I see the longer reads being of far more use for RNA-Seq and de novo assembly though.

3
Entering edit mode
8.8 years ago

As long as the 3' base quality stays high enough to use close to 250bp, it seems you would have less ambiguous placement of split reads for RNA-seq. It also seems like you could select for ~300bp fragment sizes in your library and develop methods to detect base miscalls vs. PCR errors using even the lower quality 3' overlapping sequence. I'm not sure if in-read indel detection is restricted by discovery (number of reads containing mappable sequence flanking an indel) or whether it is computationally prohibitive to consider the gapped alignments.

3
Entering edit mode
8.8 years ago

It just occurred to me (and perhaps this is common knowledge) that when detecting structural rearrangements longer read lengths can have have detrimental effects.

Say you had a 100 bp translocation, you could easily identify that with 50 bp paired end reads having an unexpected insert size, they will still map inside the region. But if you had 250bp reads that would cover the entire region then none of them would map anymore and that will lead to a hole in both location. That is less information than before.

3
Entering edit mode

I think what this really means is that, instead of mate-pair mapping used to detect translocations, we would need to move toward a split read approach. For DNA sequencing all of the unmapped reads could potentially span translocation junctions and could be mapped using multiple random seeds from each read. If you have seeds that map and can be expanded such that two distantly mapped seeds uniquely expand to encompass two halves of the entire read, then you can still directly detect the translocation.

0
Entering edit mode

or perhaps pursuing the assembly - once the reads are sufficiently long this becomes more reasonable.

further musings: the since the insert sizes don't seem to be growing much (if at all) we may be evolving towards the situation where the reads run into one another and will start to overlap perhaps even fully. That's another characteristic that could be ripe for new techniques.

0
Entering edit mode

2
Entering edit mode
8.8 years ago

16S rRNA regions V1-V3 is about 430bp, so 250bp pairs with 70bp overlap will cover that region

0
Entering edit mode

Yes. Same thing for the 18S rRNA. The often targeted V4 region is about 400 bp and should be covered by 250 bp-pairs. But I am worried about the taxa with longer V4 regions: they might dissapear from our radars if we switch from 454 to Illumina.

1
Entering edit mode

And for Alu, the hardest part in assembly.

1
Entering edit mode
8.8 years ago
Phis ★ 1.1k

Well, this is just stating the obvious, but for anyone working with non-model organisms, improved read lengths are a VERY welcome thing.