Question: Expected Value Of Finding A Subsequence In A Sequence Of Length L
1
6.6 years ago by
Tim10
Tim10 wrote:

Hi,

I was looking to find the expected value/frequency of finding a subsequence, within a sequence of length L.

For example, assuming that all nucleotides are equally probable, how many times would we expect to find the pattern 'ATTG' in a sequence of length = 20.

I tried looking in biostrings and IRanges (bioconductor), but didn't find what I was looking for.

many thanks!

sequence statistics • 2.6k views
modified 6.6 years ago by brentp23k • written 6.6 years ago by Tim10
4
6.6 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

Assuming equal base-probability, the probability of the exact 4-base sub-sequence 'ATTG' is 0.25 ^ 4. In a 20 base sequence you can fit a 4 base sequence in 17 places. So the expected number of occurrences is 0.25 ^ 4 * 17 == 0.06640625

You can make this generic:

``````def prob(seq_len, subseq_len):
if subseq_len > seq_len: return 0
places = seq_len - subseq_len + 1
prob = 0.25 ** subseq_len
return prob * places

print prob(20, 4)
``````

translating to R is up to you.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.