Hi,

I was looking to find the expected value/frequency of finding a subsequence, within a sequence of length L.

For example, assuming that all nucleotides are equally probable, how many times would we expect to find the pattern 'ATTG' in a sequence of length = 20.

I tried looking in biostrings and IRanges (bioconductor), but didn't find what I was looking for.

many thanks!

Assuming equal base-probability, the probability of the exact 4-base sub-sequence 'ATTG' is 0.25 ^ 4. In a 20 base sequence you can fit a 4 base sequence in 17 places. So the expected number of occurrences is 0.25 ^ 4 * 17 == 0.06640625

You can make this generic:

``````def prob(seq_len, subseq_len):
if subseq_len > seq_len: return 0
places = seq_len - subseq_len + 1
prob = 0.25 ** subseq_len
return prob * places

print prob(20, 4)
``````

translating to R is up to you.

