Question

Really Basic Statistics: Expected frequency?

0

Entering edit mode

8.7 years ago

jth ▴ 190

Hi everyone,

I have a question on basic statistics. I feel really bad about asking this but I'm having a brain freeze now. So, I'm going back to basics.

Purpose: Calculate expected trinucleotide frequencies specific to 5'UTR regions only, which will be used to check whether some windows (around 20bp long) at particular locations (such as beginning or end) in a set of specific 5'UTRs have different amount of these trinucleotides by calculating observed-to-expected ratio.

Calculated Data: I collected all 5'UTR sequences and counted all nucleotides. I have calculated their frequency as follows:

where , , , represents occurrence count of each nucleotide in all 5'UTR regions (these regions only!) and represents total amount of nucleotides in all 5'UTR regions.

From this I calculate expected frequency of a trinucleotide by

where . Then, for each window of interest, I count the occurrence of each trinucleotide and calculate observed frequency by

where represents the occurrence of trinucleotide and represents the window size.

Lastly, observed-to-expected ratio becomes

So, I feel like I'm thinking something extremely wrong in this basic thing and I should take window size in the expected frequency calculation as well. But I couldn't be able to convince myself to do this since I already get a "frequency" on the observed part as well.

Am I doing something really wrong and misinterpreting/overthinking everything now?

EDIT: I uploaded a couple of images to a more compatible website but decided it wasn't worth the effort - Ram.

genome analysis statistics • 2.6k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 8.7 years ago by jth ▴ 190

0

Entering edit mode

4.6 years ago

mbramble • 0

I don't think your expected frequency is a frequency at all; it is the probability of seeing a certain trinucleotide at any random three-base site in your window or any region. To get a window frequency from your probability, you would need to multiply by something analogous to a binomial coefficient, which in this case would be the total number of 3-base sites in a window length w, or w-3 (ignoring strandedness). -Unless I'm missing something in the notation.

ADD COMMENT • link 4.6 years ago by mbramble • 0

Ram · Accepted Answer · 2016-02-04

2

Entering edit mode

8.7 years ago

Devon Ryan 104k

What you presented is fine, you've already used the window size in the observed frequency.

BTW, I assume you're taking strand into account here when getting the observed counts.

ADD COMMENT • link 8.7 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks a lot! Yes, I look for strand effect as well by doing same calculation for the sense strand and anti-sense strand separately first, then I'm getting a sum of occurrences in both strands and divide by window_size*2 and calculate a obs-to-exp ratio from that as well.

ADD REPLY • link updated 4.7 years ago by Ram 44k • written 8.7 years ago by jth ▴ 190