Length Of Read Needed To Confidently Map Sequence
3
2
Entering edit mode
11.3 years ago
Isaac Joseph ▴ 150

How many base pairs would I need confidently map an arbitrary DNA sequence to a location in the human genome?

Mathematically, it seems like 16 bases should do the trick (since 4^16 = 4 billion, which is more than the number of mapping locations in the human genome). Is this borne out in the real world? I know that repetitive sequences, etc. might complicate this analysis.

alignment mapping • 8.3k views
ADD COMMENT
1
Entering edit mode

The old rule (when I used to design PCR primers by hand) was 18-22 bases was a good measure as it had a reasonable chance of being unique, as well as having an appropriate Tm for PCR

ADD REPLY
6
Entering edit mode
11.2 years ago

By doing self-alignment, we can see exactly what percentage of the genome is uniquely mappable with different size reads. I've got these results laying around:

enter image description here

To be clear these are calculated by taking each possible read of a given length, mapping it back with BWA, and then determining whether it is mapped uniquely to the correct position. You can calculate these yourself in a pretty straightforward manner, or look at the "Mapability" track in UCSC to grab some pre-computed ones.

ADD COMMENT
1
Entering edit mode

With short-insert Illumina paired-end reads, you can reach somewhere around 82-85% for 2*35bp reads and 94-95% for 2*100bp.

ADD REPLY
2
Entering edit mode

Right. This calculation used single-end reads, so the numbers will be lower than what you can get from paired-end reads, using that extra information.

ADD REPLY
4
Entering edit mode
11.3 years ago

No genome resembles a random distribution of bases thus your formula does not apply directly.

The first high-throughput instruments had read lengths of 35. I would take that as guidance for the necessary length to produce acceptable rates for mapping.

ADD COMMENT
3
Entering edit mode
11.2 years ago
mchaisso ▴ 160

I had to do a similar calculation for a recent paper. To break past the asymptotic line, the lengths of reads have to get much longer: http://www.biomedcentral.com/content/pdf/1471-2105-13-238.pdf (figure 6)

ADD COMMENT

Login before adding your answer.

Traffic: 2916 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6