Hi everyone,

I have a question on basic statistics. I feel really bad about asking this but I'm having a brain freeze now. So, I'm going back to basics.

**Purpose: **Calculate expected trinucleotide frequencies specific to 5'UTR regions only, which will be used to check whether some windows (around 20bp long) at particular locations (such as beginning or end) in a set of specific 5'UTRs have different amount of these trinucleotides by calculating observed-to-expected ratio.

**Calculated Data**: I collected all 5'UTR sequences and counted all nucleotides. I have calculated their frequency as follows:

where , , , represents occurrence count of each nucleotide in all 5'UTR regions (these regions only!) and represents total amount of nucleotides in all 5'UTR regions.

From this I calculate expected frequency of a trinucleotide by

where . Then, for each window of interest, I count the occurrence of each trinucleotide and calculate observed frequency by

where represents the occurrence of trinucleotide and represents the window size.

Lastly, observed-to-expected ratio becomes

So, I feel like I'm thinking something extremely wrong in this basic thing and I should take window size in the expected frequency calculation as well. But I couldn't be able to convince myself to do this since I already get a "frequency" on the observed part as well.

Am I doing something really wrong and misinterpreting/overthinking everything now?

EDIT: I uploaded a couple of images to a more compatible website but decided it wasn't worth the effort - Ram.

Thanks a lot! Yes, I look for strand effect as well by doing same calculation for the sense strand and anti-sense strand separately first, then I'm getting a sum of occurrences in both strands and divide by

`window_size*2`

and calculate a obs-to-exp ratio from that as well.