Length Of Read Needed To Confidently Map Sequence
3
2
Entering edit mode
11.6 years ago
Isaac Joseph ▴ 170

How many base pairs would I need confidently map an arbitrary DNA sequence to a location in the human genome?

Mathematically, it seems like 16 bases should do the trick (since 4^16 = 4 billion, which is more than the number of mapping locations in the human genome). Is this borne out in the real world? I know that repetitive sequences, etc. might complicate this analysis.

alignment mapping • 8.5k views
1
Entering edit mode

The old rule (when I used to design PCR primers by hand) was 18-22 bases was a good measure as it had a reasonable chance of being unique, as well as having an appropriate Tm for PCR

6
Entering edit mode
11.6 years ago

By doing self-alignment, we can see exactly what percentage of the genome is uniquely mappable with different size reads. I've got these results laying around:

To be clear these are calculated by taking each possible read of a given length, mapping it back with BWA, and then determining whether it is mapped uniquely to the correct position. You can calculate these yourself in a pretty straightforward manner, or look at the "Mapability" track in UCSC to grab some pre-computed ones.

1
Entering edit mode

With short-insert Illumina paired-end reads, you can reach somewhere around 82-85% for 2*35bp reads and 94-95% for 2*100bp.

2
Entering edit mode

Right. This calculation used single-end reads, so the numbers will be lower than what you can get from paired-end reads, using that extra information.

4
Entering edit mode
11.6 years ago

No genome resembles a random distribution of bases thus your formula does not apply directly.

The first high-throughput instruments had read lengths of 35. I would take that as guidance for the necessary length to produce acceptable rates for mapping.

3
Entering edit mode
11.6 years ago
mchaisso ▴ 160

I had to do a similar calculation for a recent paper. To break past the asymptotic line, the lengths of reads have to get much longer: http://www.biomedcentral.com/content/pdf/1471-2105-13-238.pdf (figure 6)