Really Basic Statistics: Expected frequency?
2
0
Entering edit mode
8.7 years ago
jth ▴ 190

Hi everyone,

I have a question on basic statistics. I feel really bad about asking this but I'm having a brain freeze now. So, I'm going back to basics.

Purpose: Calculate expected trinucleotide frequencies specific to 5'UTR regions only, which will be used to check whether some windows (around 20bp long) at particular locations (such as beginning or end) in a set of specific 5'UTRs have different amount of these trinucleotides by calculating observed-to-expected ratio.

Calculated Data: I collected all 5'UTR sequences and counted all nucleotides. I have calculated their frequency as follows:

where , , , represents occurrence count of each nucleotide in all 5'UTR regions (these regions only!) and represents total amount of nucleotides in all 5'UTR regions.

From this I calculate expected frequency of a trinucleotide by

where . Then, for each window of interest, I count the occurrence of each trinucleotide and calculate observed frequency by

where represents the occurrence of trinucleotide and represents the window size.

Lastly, observed-to-expected ratio becomes

So, I feel like I'm thinking something extremely wrong in this basic thing and I should take window size in the expected frequency calculation as well. But I couldn't be able to convince myself to do this since I already get a "frequency" on the observed part as well.

Am I doing something really wrong and misinterpreting/overthinking everything now?

EDIT: I uploaded a couple of images to a more compatible website but decided it wasn't worth the effort - Ram.

genome analysis statistics • 2.6k views
ADD COMMENT
2
Entering edit mode
8.7 years ago

What you presented is fine, you've already used the window size in the observed frequency.

BTW, I assume you're taking strand into account here when getting the observed counts.

ADD COMMENT
0
Entering edit mode

Thanks a lot! Yes, I look for strand effect as well by doing same calculation for the sense strand and anti-sense strand separately first, then I'm getting a sum of occurrences in both strands and divide by window_size*2 and calculate a obs-to-exp ratio from that as well.

ADD REPLY
0
Entering edit mode
4.6 years ago
mbramble • 0

I don't think your expected frequency is a frequency at all; it is the probability of seeing a certain trinucleotide at any random three-base site in your window or any region. To get a window frequency from your probability, you would need to multiply by something analogous to a binomial coefficient, which in this case would be the total number of 3-base sites in a window length w, or w-3 (ignoring strandedness). -Unless I'm missing something in the notation.

ADD COMMENT

Login before adding your answer.

Traffic: 2045 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6