Question

Statistical modelling - how to findout the significance of a sequence feature

0

Entering edit mode

6.4 years ago

ransinha • 0

A biologist plans to sequence 50 nucleotides upstream of the transcription initiation sites of 10 distinct genes activated under the same experimental conditions. The biologist wants to use these data to test the hypothesis that the oligomer CCAGG is contributing to promoter recognition. As the consultant bioinformatician, you are tasked to come up with a plan for statistical evaluation of experimental data. Consider two cases: 1) You have no a priori knowledge of expected nucleotide frequencies in these upstream regions; 2) You can assume that these particular upstream regions have expected nucleotide frequencies similar to those in known promoter regions.

Continuing the previous example, after looking at the experimental data it seems to you that there is no evidence for a role of CCAGG, but the oligomer TTCAA really sticks out to you. What are you going to do with this hunch, and how are you going to advise your biology colleague?

sequencing statistical modelling • 963 views

ADD COMMENT • link 6.4 years ago by ransinha • 0

0

Entering edit mode

Looks like a school assignment. We're not going to do it for you. Show us what your reasoning is and where you're stuck and someone may help you.

ADD REPLY • link 6.4 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Solution: 1. If we have no knowledge of the of the expected nucleotide frequencies before hand, we can still prove experimentally that the presence of the oligomer CCAGG is contributing to promoter recognition and in the absence of the specific oligomer the event is not taking place. Now it is tougher to prove the only if part ie only in the presence of the oligomer the event takes place. But the biologists can address this problem by showing that they can abrogate the process by experimentally disrupting the oligomer. 2. First, we make the null hypothesis, H0 : "CCAGG is contributing to the promoter recognition". The alternate hypothesis would be H1 : "CCAGG doesn't contribute to the promoter recognition." The evidence in the trial is our data. We can assume that these particular upstream regions have exected nucleotide frequencies similar to those in known promoter regions. We shall calculate the P-value to weigh the strength of the hypothesis. [Formula] Where, p = p-value, Pr[s] = probability of each sequence in the sequence space \omega Pr[S] = probability of the the sequence/oligomer we are looking at - (Assumed to be same as the known promoter regions) 1{.} = an indicator function that returns 1 when the argument is true, otherwise 0

• A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis.
• A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis.
• p-values very close to the cutoff (0.05) are considered to be marginal (could go either way). Always report the p-value so your readers can draw their own conclusions.