So this question came up in a lecture on "Introduction to transcription". Our professor asked, what is the probability of having the same starting base on the forward and the reverse strand. In other words, if I have
5'_______3'
ATTGCCATAT
TAACGGTATA
3'_______5'
What are the odds of that happening? (same for other bases, T,G,C)
My answer is as follows:
P(A)=P(T)=P(G)=P(C)=1/4
So, P(Aon5' and Aon3')=P(A).P(A) =1/16
and, P(Gon5' and Gon3')=P(G).P(G)=1/16, and going on like this, we add up (for all bases) and get 1/4. Am I correct?
The issue here is this fact: The probability of having same base on reverse strand =1/4 = Probability of having any one base
Is there any significance to this?
This model assumes that the p for each base is 1/4. Coming from a biological standpoint, transcription (start) sites are highly clustered by binding motifs, so by far not a random distribution of nucleotides. To have it accurately, one probably needs to correct for factors like GC content. So I would say a naive probability as you propose will not be accurate. Maybe you have a look at papers about motif enrichments and how they model nucleotide occurrence.
Thanks for replying. Will look into factors like GC content.