Probability Of Finding A Dna Sequence In A Window
4
9
Entering edit mode
9.6 years ago

Greetings,

A friend asked me what is the probability of observing a 7-mer in a 10kb genomic stretch. She assumes the 7-mer is biologically functional, however she wanted to know the probability that the 7-mer happend by chance.

I have two answers, but I don't know which is right. They might be the same just stated differently. I was hoping that someone with a strong probability background could help.

• 4 bases {A,T,G,C} - assuming equal probability for now.
• 4^7 combinations of the 4 bases.
• 1 / 4^7 of randomly observing any specific 7mer = 6.103516e-05
• The sequence would be seen ~ every 4^7 bases on average? = 16,384

alternatively I have found:

so the expected number of occurrences of that k-mer is N/4^k,

• 10kb / 4 ^ 7 = 0.6103516

These are very different answers. Can anyone provide some clarity?

Thanks

dna primer • 33k views
0
Entering edit mode

I got the same problem but little bit vary. suppose I have a motif sequence as GATAAAG. sequence length is 1000 bp. this is a 7-mer. so as you descripe i can get how many expected occurrence of such 7-mer within 1000bp. this includes all possible combination of expected counts. but if I need to count expected value of exact same order of sequence as mention above how do we modify this calculation?

9
Entering edit mode
9.6 years ago

I find it easier to work things out when looking for the reverse problem - what is the chance of not seeing an event.

That way the probabilities can be multiplied, whereas if you are looking for matches then you have to account for seeing things once, twice etc.

If the probability of not generating a given kmer is (1-p) then the chance of not seeing it after giving it N tries is (1-p)^N (double the N if you take the sequence to be two stranded)

(1-1/(4^7))^(2*10^4)
.29501166548155827197


This is the chance of not seeing the kmer at all.

2
Entering edit mode

Or if, like me, you always have an R session open pbinom(q=0, size=2e4, prob=1/4^7) :)

1
Entering edit mode

although come to think of it the reverse strand is not random so multiplying by two is almost certainly incorrect - the real value would be between the N and 2N

1
Entering edit mode

.295 ~~ 1 - 0.6103516.... :).

0
Entering edit mode

as mention above, this gives all possible k-mer occrrence for a given sequence. but if I need to find only the given order of motif such as GATAAAG( i dont need all combinations of k-mer) how do I do this?

0
Entering edit mode

Has your problem solved?I've had the same problem. Besides,what should I do for the same sequence on the antichain? Thank you in advance~

3
Entering edit mode
9.6 years ago
David W 4.9k
The sequence would be seen ~ every 4^7 bases on average? = 16,384

so the expected number of occurrences of that k-mer is N/4^k = 10kb / 4 ^ 7 = 0.6103516


These are very different answers. Can anyone provide some clarity?

They are answers to different questions, and neither of those are the question you want to ask :)

• In the first you calculate the expected "waiting time" (i.e. the numer of 7-mer windows you'd expect to look at before you found a match)

• In the second you calculate the expected _number_ of times the specific kmer will occur in a 10k window

• The probability of observing the k-mer at least once is in a 10kb window described in Istvan's answer.

2
Entering edit mode
9.6 years ago

I recently had a very similar question with a motif that I was working on, and I took two approaches, one analytical and one empirical way:

First approach: I did almost what you suggested. In addition, I also took the base composition of the scanned sequences into account. E.g, if there are 30% As, the probability to see an A at a specific position is 0.3. The probability of a 7-mer like AATGCCA would be:

(prob. of A) ^ 3 * (prob. of C) ^ 2 * (prob. of T) ^ 1 * (prob. of G) ^ 1

This probability times the number of scanned windows (10000 – 6 for 10kB) gives the number of occurrences that we would expect just by chance.

What I did not like about this approach is that I assume the scanned windows to be independent from each other, which they are not, because each window overlaps by 6 bases with the neighboring windows.

Second approach: I shuffled the nucleotides of the scanned sequence many times, each time scanning the randomly generated sequence and counting the occurrences of my motif. On average, this gave me exactly the result that I predicted with my first approach. I guess that the dependence of the windows is not of practical relevance for long sequences.

In addition, one could calculate something like a p-value for the observed occurrences, asking: What is the probability to find the motif at least as often as we did just by chance. In terms of my first approach, one would use a Poisson distribution to do this, using the calculated probability of the k-mer and the number of scanned windows as n. In terms of the second approach, one would just see in how many cases the motif was found as often as actually observed or more in the shuffled sequences. For example: You found the motif twice in your scanned sequence. After shuffling the sequence 1000 times, in 30 cases you found the motif twice or even more often than twice. This would give you a p-value of 0.03.

2
Entering edit mode
9.6 years ago
Lee Baker ▴ 30

Here's how I think about it:

• If the target you are looking for it two-stranded, that gives 2 possible 7-mers that can match any specific location (or equivalently, 2 possible locations that a 7-mer can match).
• As you said, there are 4^7 = 16384 possible 7-mers, giving a 2 out of 16384 chance of seeing a specific 7-mer at each location.
• In a 10000kb region, there are 10000 - (7-1) = 9994 possible positions for a kmer.
• The probability of seeing it at least once is the sum of the probability of seeing it once, plus the probability of seeing it twice, plus the probability of seeing it three times, etc; it is simpler to calculate the probability of not seeing it, so let's do that.

So, combining the above:

• The chance that the kmer isn't at one specific location is 16382/16384, or about 99.99%.
• Apply this probability across all locations: (16382/16384)^9994, the probability that the 7-mer does not exist in this region is 29.52%, and 70.48% that it exists one or more times, which agrees with Istvan's answer above.