Suppose we have two random sequences A and B of L nucleotide each. What is the probability to observe a common substring (kmer) of at least k nucleotides between those two strings ? What I have got so far :
#param : k = 3 #common substring must be of at least 3 nucleotides L = length(A) = length(B) = 12 #the two random sequences are 12 nt long prob_kmer = 1/(4**k) # 1/64, the probability to observe a particular sequence in a random kmer N_prob_kmer = 1 - prob_kmer # 63/64, the probability to NOT observe a particular sequence in a random kmer nb_kmer = L - k + 1 # 10, the number of possible kmer in a sequence of length 12 nb_comparison = nb_kmer**2 # 100, the number of kmer comparison between the two sequences of length 12 P = 1 - ((N_prob_kmer)**(nb_comparison)) # 80%, probability to find at least one of the kmer of sequence A matching a kmer of sequence B.
Is this correct ? I'm concerned that subsequent k-mer do not have independent sequences (they share some nucleotide since they are overlapping) so the probabilities are not independent either... But I have no idea how to take that into account in my calculation.