Question

probability of mapping a random string of four nucleotides in a genome

0

Entering edit mode

9.9 years ago

Ric ▴ 440

Hello,

What is the probability of finding a random string of four nucleotides in a reference genome of length N?
What is the probability of uniquely mapping a random string of four nucleotides to reference genome of length N?
How the probability of uniquely mapped reads improves by paired-end data?

Thank you in advance.

mapping probability ngs dna • 2.5k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 9.9 years ago by Ric ▴ 440

Ram · Accepted Answer · 2015-08-27

Hi,

I'm no statistician but the chance is probably very high; 4^4 = 256 so about a 1 in 256 chance to find the string back. So if you think in terms of kmer's if you take a kmer of 4 and then the chance of one of these kmers being your random string is 1 in 256. So it will probably be something like (N-3)/256 chance to map your random string.
1 in 256 so the prob to uniquely map is 1 - ((N-3)/256), so it will probably have a negative probability due to the fact that it will almost definitely map on multiple locations.
Reads are normally at least 100bp (although there are sequencers that go down to 28bp), so their probability to map to a reference are more in the range of 4^100 = 1,6E+60. The reverse read will not increase the probability of the first read mapping uniquely. However you now have two paired reads that should have mapped closely (depending on your insert size) together and should both have mapped only once. So it does not increase the probability of the first read to map uniquely, but it increases the confidence with which you can say it mapped uniquely.

Like I said this is just a super basic guesstimate. There are a lot of factors that play a real role, if your 4 letter random string contains more GCs it will change the probability of mapping. If there are repeats it could affect the mapping prob etc etc.