I would like to try and create a statistical model using conditional probability for a bacterial species and for a bacterial genus by using k-mers. If I take as an example the Staphylococcus genus. I have the raw k-mer counts (including the reverse complement counts) for all the Staphylococcus bacteria from NCBI where 3 <=k <= 30.
If I want to try and create a conditional probability statistical model - an example Staph auerus strain X (Sp) and k-mer of interest is ACTGC (k) - using Bayes Theorem:
P(Sp | k) = P(k | Sp) * P(Sp) / P(k)
P(k | Sp) I have calculated from the raw count for S. aureus = total count (ACTGC) / total count of k-mers for S. aureus where k = 5
P(Sp) - I assume this value = 1? I am only dealing with S. aureus strain X so the probability of S. aureus Strain x = 1?
P(k) = count of k-mer ACTGC / total possible k-mers (In a sequence of length N there are N-k+1 possible k-mers) so P(k) = count of ACTGC / (N-k+1)
Am I on the right track? or should I just use P(A|B) = P(A and B) / P(B) where A = S.aureus strain X and B = k-mer of interest. Even in the latter should the P(B) be count of k-mer/N-k+1 or the observed frequency from the raw k-mer count.
When should I use Bayes theorem v's P(A|B) = P(A and B) / P(B).
I would really appreciate some help with how to create a conditional probability model for a bacterial species (and eventually and probability model for a bacterial genus).