A biologist plans to sequence 50 nucleotides upstream of the transcription initiation sites of 10 distinct genes activated under the same experimental conditions. The biologist wants to use these data to test the hypothesis that the oligomer CCAGG is contributing to promoter recognition. As the consultant bioinformatician, you are tasked to come up with a plan for statistical evaluation of experimental data. Consider two cases: 1) You have no a priori knowledge of expected nucleotide frequencies in these upstream regions; 2) You can assume that these particular upstream regions have expected nucleotide frequencies similar to those in known promoter regions.
Continuing the previous example, after looking at the experimental data it seems to you that there is no evidence for a role of CCAGG, but the oligomer TTCAA really sticks out to you. What are you going to do with this hunch, and how are you going to advise your biology colleague?
Looks like a school assignment. We're not going to do it for you. Show us what your reasoning is and where you're stuck and someone may help you.
Solution: 1. If we have no knowledge of the of the expected nucleotide frequencies before hand, we can still prove experimentally that the presence of the oligomer CCAGG is contributing to promoter recognition and in the absence of the specific oligomer the event is not taking place. Now it is tougher to prove the only if part ie only in the presence of the oligomer the event takes place. But the biologists can address this problem by showing that they can abrogate the process by experimentally disrupting the oligomer. 2. First, we make the null hypothesis, H0 : "CCAGG is contributing to the promoter recognition". The alternate hypothesis would be H1 : "CCAGG doesn't contribute to the promoter recognition." The evidence in the trial is our data. We can assume that these particular upstream regions have exected nucleotide frequencies similar to those in known promoter regions. We shall calculate the P-value to weigh the strength of the hypothesis. [Formula] Where, p = p-value, Pr[s] = probability of each sequence in the sequence space \omega Pr[S] = probability of the the sequence/oligomer we are looking at - (Assumed to be same as the known promoter regions) 1{.} = an indicator function that returns 1 when the argument is true, otherwise 0
I'm not sure if I'm going right.