Question: What Does 2X250Bp Buy Us?
3
gravatar for Jeremy Leipzig
4.3 years ago by
Philadelphia, PA
Jeremy Leipzig16k wrote:

From the Illumina press release

HiSeq 2500 chemistry enhancements empower the industry’s highest daily throughput and drive down the price of whole-genome sequencing. With support of paired 250 base pair read lengths in rapid run mode, the HiSeq 2500 will be capable of generating up to 300 Gb in rapid mode with sample to data in less than three days. These enhancements will be available in the second half of this year.

What will these much longer read lengths enable in terms of improved analysis?

illumina • 1.8k views
ADD COMMENTlink modified 4.3 years ago by Phis1.0k • written 4.3 years ago by Jeremy Leipzig16k
4
gravatar for Dan Gaston
4.3 years ago by
Dan Gaston6.6k
Canada
Dan Gaston6.6k wrote:

I think longer reads in general, as long as there isn't an associated increase in error rates, are ultimately good. Even with just the human genome there are still many regions that are hard to do reliable mapping on due to repetitive elements and low sequence complexity. Longer reads can help with those areas (although the reference itself is of course still problematic there). And any overlap between pairs should, theoretically, help with error correction to na extent.

ADD COMMENTlink written 4.3 years ago by Dan Gaston6.6k
3

Many repeats are still mappable given paired-end reads. With the standard 2*100bp, about 94-95% of human genome is callable. 2*250bp may not give you a big improvement. De novo assembly will greatly benefit from longer reads. Although in theory we can also use PE reads to assemble through Alu, in practice few (if any) assemblers are really working this out. Overlapping PE reads in ~400bp is much preferred for de novo assembly.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by lh329k
1

I agree that things will only be slight improvements for what I listed. I would argue that there is still a decent portion of that callable percentage that can still be problematic. We see this routinely even working with targeted Exome data. 250 bp reads probably won't improve much, but they may improve it slightly in some of these regions. I see the longer reads being of far more use for RNA-Seq and de novo assembly though.

ADD REPLYlink written 4.3 years ago by Dan Gaston6.6k
3
gravatar for Matt Shirley
4.3 years ago by
Matt Shirley7.2k
Cambridge, MA
Matt Shirley7.2k wrote:

As long as the 3' base quality stays high enough to use close to 250bp, it seems you would have less ambiguous placement of split reads for RNA-seq. It also seems like you could select for ~300bp fragment sizes in your library and develop methods to detect base miscalls vs. PCR errors using even the lower quality 3' overlapping sequence. I'm not sure if in-read indel detection is restricted by discovery (number of reads containing mappable sequence flanking an indel) or whether it is computationally prohibitive to consider the gapped alignments.

ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by Matt Shirley7.2k
3
gravatar for Istvan Albert
4.3 years ago by
Istvan Albert ♦♦ 70k
University Park, USA
Istvan Albert ♦♦ 70k wrote:

It just occurred to me (and perhaps this is common knowledge) that when detecting structural rearrangements longer read lengths can have have detrimental effects.

Say you had a 100 bp translocation, you could easily identify that with 50 bp paired end reads having an unexpected insert size, they will still map inside the region. But if you had 250bp reads that would cover the entire region then none of them would map anymore and that will lead to a hole in both location. That is less information than before.

ADD COMMENTlink written 4.3 years ago by Istvan Albert ♦♦ 70k
3

I think what this really means is that, instead of mate-pair mapping used to detect translocations, we would need to move toward a split read approach. For DNA sequencing all of the unmapped reads could potentially span translocation junctions and could be mapped using multiple random seeds from each read. If you have seeds that map and can be expanded such that two distantly mapped seeds uniquely expand to encompass two halves of the entire read, then you can still directly detect the translocation.

ADD REPLYlink written 4.3 years ago by Matt Shirley7.2k

or perhaps pursuing the assembly - once the reads are sufficiently long this becomes more reasonable.

further musings: the since the insert sizes don't seem to be growing much (if at all) we may be evolving towards the situation where the reads run into one another and will start to overlap perhaps even fully. That's another characteristic that could be ripe for new techniques.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Istvan Albert ♦♦ 70k

@matt shirley: +1 for your comment on split-read analysis methods. Even with 100bp reads, using split-read approaches are practical; 250 bp reads most likely make them necessary.

ADD REPLYlink written 4.3 years ago by Sean Davis23k
2
gravatar for Jeremy Leipzig
4.3 years ago by
Philadelphia, PA
Jeremy Leipzig16k wrote:

16S rRNA regions V1-V3 is about 430bp, so 250bp pairs with 70bp overlap will cover that region

ADD COMMENTlink written 4.3 years ago by Jeremy Leipzig16k

Yes. Same thing for the 18S rRNA. The often targeted V4 region is about 400 bp and should be covered by 250 bp-pairs. But I am worried about the taxa with longer V4 regions: they might dissapear from our radars if we switch from 454 to Illumina.

ADD REPLYlink written 4.3 years ago by Frédéric Mahé2.5k
1

And for Alu, the hardest part in assembly.

ADD REPLYlink written 4.3 years ago by lh329k
1
gravatar for Phis
4.3 years ago by
Phis1.0k
CH
Phis1.0k wrote:

Well, this is just stating the obvious, but for anyone working with non-model organisms, improved read lengths are a VERY welcome thing.

ADD COMMENTlink written 4.3 years ago by Phis1.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 932 users visited in the last hour