Hi all,
I’m trying to generate DNA sequences using a first-order Markov model in GenRGenS. For this, I have to input the frequencies of all possible dinucleotides. I have a list of specific values of GC content and CpG fraction (number of CpG dinucleotides divided by the length of the DNA fragment) that I need to use when generating the DNA sequences. I’m having trouble figuring out how to calculate the frequencies of the dinucleotides from the given GC content and CpG fraction, especially for the dinucleotides CC, CA and CT.
This is what I have tried so far:
k = G+C content
l = CpG fraction
P(1st2nd|1st) = ?
P({C, G}) = k
P(CG) = l
P(AA|A) = (1-k)/2
P(AC|A) = k/2
P(AG|A) = k/2
P(AT|A) = (1-k)/2
P(GA|G) = (1-k)/2
P(GC|G) = k/2
P(GG|G) = k/2
P(GT|G) = (1-k)/2
P(TA|T) = (1-k)/2
P(TC|T) = k/2
P(TG|T) = k/2
P(TT|T) = (1-k)/2
consider
P(CG) = P(CG|C)P(C) = l
P({CG, CC}|C) = k
therefore
P(CG|C) = l / P(C) = l / (k/2) = 2l/k
P(CC|C) = k - 2l/k
P(CA|C) = (1-k)/2
P(CT|C) = (1-k)/2
However, this does not work as P(CC|C) = k - 2l/k gives me a negative probability for the values I’m working with.
Thank you,
Veronica