I would like to try and create a statistical model using conditional probability for a bacterial species and for a bacterial genus by using k-mers. If I take as an example the Staphylococcus genus. I have the raw k-mer counts (including the reverse complement counts) for all the Staphylococcus bacteria from NCBI where 3 <=k <= 30.

If I want to try and create a conditional probability statistical model - an example Staph auerus strain X (Sp) and k-mer of interest is ACTGC (k) - using Bayes Theorem:

P(Sp | k) = P(k | Sp) * P(Sp) / P(k)

P(k | Sp) I have calculated from the raw count for S. aureus = total count (ACTGC) / total count of k-mers for S. aureus where k = 5

P(Sp) - I assume this value = 1? I am only dealing with S. aureus strain X so the probability of S. aureus Strain x = 1?

P(k) = count of k-mer ACTGC / total possible k-mers (In a sequence of length N there are N-k+1 possible k-mers) so P(k) = count of ACTGC / (N-k+1)

Am I on the right track? or should I just use P(A|B) = P(A and B) / P(B) where A = S.aureus strain X and B = k-mer of interest. Even in the latter should the P(B) be count of k-mer/N-k+1 or the observed frequency from the raw k-mer count.

When should I use Bayes theorem v's P(A|B) = P(A and B) / P(B).

I would really appreciate some help with how to create a conditional probability model for a bacterial species (and eventually and probability model for a bacterial genus).

Thank you for your answer - let me clarify what I want to do with more detail.

Lets say I have 50 sequences - 40 are Staphylococcus aureus and 10 are not (they are in the Staph genus however). So my data set is purely over the sequences for Staph genus.

What I want to do is given a sequence "S" (e.g. 400 base pairs randomly chosen from a Staphylococcus aureus genome) I want to calculate the conditional probability the unknown sequence is Staphylococcus aureus using k-mers. Lets just work with k = 3.

First I create a collection of k-mers of length 3 from the sequence "S". I have already created k-mer profiles for each of the Staph sequences for the value k =3. I then iterate through the collection of k-mers from "S" and see if that k-mer exists in the k-mer profiles for each of the Staphylococcus aureus sequences and also do the same for non Staphylococcus aureus genomes.

For the sake of example let say my sequence "S" consists of S = { GTT, TTG, TGG }

Of the 40 Staphylococcus aureus sequences => 30/40 have GGT, 20/40 TTG, 10/40 TGG

Of the 10 Non Staphylococcus aureus genomes I have => 2/10 have CGT, 5/10 TTG, 10/10 TGG

Naive Bayes=> P(Sa | S) = P(S| Sa) * P (Sa) / P(S) => P(Sa) probability of S.aureus, P(S) probability of sequence of kmers

P(Sa|S ) = (0.75 * 0.5 * 0.25 * .8) / ((0.75 * 0.5 * 0.25 * .8) + (0.2 * 0.5 * 1.0 * 0.2))

So this does not take into account the frequency of the k-mer in the k-mer profiles for the genomic sequences - it just counts if the k-mer actually exists in the Staph profiles. That is to say in one Staph aureus sequence CGT may occur 45 times however I just count the occurrence of the k-mer in the sequence as 1. In effect there is no weighting for the frequency of a particular k-mer in a genome.

What I want to know is have I applied Bayes theorem correctly with the data set I have?

I am actually using logs rather than simple frequencies so in effect I am doing addition of the logs of the frequencies rather than multiplying due to underflow.

P(Sa|S) = log P(kmer1|Sa) + log P(kmer2|Sa) + ... + P(kmer_n|Sa) + log P(Sa) / P(S)

Where P(S) is total probability - S.aureus and non S.aureus again using addition of logs.

I would really appreciate any input into how I have applied Bayes Theorem to this problem. Should I be taking into account the frequency in which the k-mer occurs in each of the genomic profiles I created for the Staph organisms.

Thank you very much if you got to the end of this :)

This is actually a pretty interesting question (at least to us non-statisticians). I have a feeling you'll get a speedier reply on cross validated. Just to not freak them out, you might use a title that omits mention of bacteria :)