Expected Value Of Finding A Subsequence In A Sequence Of Length L
1
1
Entering edit mode
12.5 years ago
Tim ▴ 10

Hi,

I was looking to find the expected value/frequency of finding a subsequence, within a sequence of length L.

For example, assuming that all nucleotides are equally probable, how many times would we expect to find the pattern 'ATTG' in a sequence of length = 20.

I tried looking in biostrings and IRanges (bioconductor), but didn't find what I was looking for.

many thanks!

sequence statistics • 4.2k views
ADD COMMENT
4
Entering edit mode
12.5 years ago
brentp 24k

Assuming equal base-probability, the probability of the exact 4-base sub-sequence 'ATTG' is 0.25 ^ 4. In a 20 base sequence you can fit a 4 base sequence in 17 places. So the expected number of occurrences is 0.25 ^ 4 * 17 == 0.06640625

You can make this generic:

def prob(seq_len, subseq_len):
    if subseq_len > seq_len: return 0
    places = seq_len - subseq_len + 1
    prob = 0.25 ** subseq_len
    return prob * places

print prob(20, 4)

translating to R is up to you.

ADD COMMENT

Login before adding your answer.

Traffic: 2338 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6