Question: Expected Value Of Finding A Subsequence In A Sequence Of Length L
gravatar for Tim
6.6 years ago by
Tim10 wrote:


I was looking to find the expected value/frequency of finding a subsequence, within a sequence of length L.

For example, assuming that all nucleotides are equally probable, how many times would we expect to find the pattern 'ATTG' in a sequence of length = 20.

I tried looking in biostrings and IRanges (bioconductor), but didn't find what I was looking for.

many thanks!

sequence statistics • 2.6k views
ADD COMMENTlink modified 6.6 years ago by brentp23k • written 6.6 years ago by Tim10
gravatar for brentp
6.6 years ago by
Salt Lake City, UT
brentp23k wrote:

Assuming equal base-probability, the probability of the exact 4-base sub-sequence 'ATTG' is 0.25 ^ 4. In a 20 base sequence you can fit a 4 base sequence in 17 places. So the expected number of occurrences is 0.25 ^ 4 * 17 == 0.06640625

You can make this generic:

def prob(seq_len, subseq_len):
    if subseq_len > seq_len: return 0
    places = seq_len - subseq_len + 1
    prob = 0.25 ** subseq_len
    return prob * places

print prob(20, 4)

translating to R is up to you.

ADD COMMENTlink written 6.6 years ago by brentp23k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2056 users visited in the last hour