Probability of finding a common sub-sequence of at least 'k' nucleotides between two sequences.
2
1
Entering edit mode
6.9 years ago

Suppose we have two random sequences A and B of L nucleotide each. What is the probability to observe a common substring (kmer) of at least k nucleotides between those two strings ? What I have got so far :

#param :
k = 3  #common substring must be of at least 3 nucleotides
L = length(A) = length(B) = 12  #the two random sequences are 12 nt long

prob_kmer = 1/(4**k)  # 1/64, the probability to observe a particular sequence in a random kmer
N_prob_kmer = 1 - prob_kmer  # 63/64, the probability to NOT observe a particular sequence in a random kmer
nb_kmer = L - k + 1 # 10, the number of possible kmer in a sequence of length 12
nb_comparison = nb_kmer**2 # 100, the number of kmer comparison between the two sequences of length 12

P = 1 - ((N_prob_kmer)**(nb_comparison)) # 80%, probability to find at least one of the kmer of sequence A matching a kmer of sequence B.


Is this correct ? I'm concerned that subsequent k-mer do not have independent sequences (they share some nucleotide since they are overlapping) so the probabilities are not independent either... But I have no idea how to take that into account in my calculation.

Thanks.

probabilities k-mer substring motif probability • 6.0k views
5
Entering edit mode

Gamblers have a similar problem - for example in the case of finding the odds of getting heads n times in a row in m tries of flipping a coin. This problem with the coins was solved by de Moivre in 1738 (for details see the post at math.stackexchange.com). I implemented this in Python (might be helpful) [but I'm not sure this is right]:

N = 5    # Length of k-mer
M = 10   # Lenghth of seuqneces A, B
p = 1/4 # Probability of nucleotide match
q = 1-p  # Probability of nucleotide mismatch

P = [None for i in range(0, M+1)]
for i in range(0, N):
P[i] = 0
P[N] = p**N

for i in range(N,M):
P[i +1] = P[i] + (1-P[i-N])*q* (p**N)

print(P[M])

0
Entering edit mode

Thanks for your answer in python, its easier for me that way. I need to think through this a bit more, but are you sure that this is correct ?

Although I don't fully understand the formula yet, I feel like this is the probability to find a specific k-mer of sequence A in sequence B, which is different than the common substring problem. For instance, what would happen with that setup ?

N=5 # length of kmer
Ma=5 #length of sequence A
Mb=10 #length of sequence B

0
Entering edit mode

The longer I think about your problem, the more doubts I have concerning the answer I gave you. I need to think this through too :) I changed my answer to a comment. Please let me know if you find the solution (the problem is very interesting).

1
Entering edit mode

Sure. I just found that this problem has a name (Common Substrings in Random Strings) and a with a quick google search, that paper. Didn't read it yet, will update.

5
Entering edit mode
6.9 years ago

I would approach the problem like this:

1) Estimate number of unique kmers in each sequence.

2) Calculate probability of the sets intersecting.

I'm sure there's a closed-form solution for each, but I don't know them off the top of my head, so here's how I'd do it iteratively:

public class ProbShared {

public static void main(String args[]){
int k=Integer.parseInt(args[0]);
int len1=Integer.parseInt(args[1]);
int len2=Integer.parseInt(args[2]);

System.out.println("Cardinality 1: "+cardinality(k, len1));
System.out.println("Cardinality 2: "+cardinality(k, len2));
System.out.println("Probability:   "+probIntersect(k, len1, len2));

}

static int cardinality(int k, int seqLength){
double space=Math.pow(4, k);
int kmers=seqLength-k+1;
double unique=0;
for(int i=0; i<kmers; i++){
double prob=(space-unique)/space;
unique+=prob;
}
return (int)Math.round(unique);
}

static double probIntersect(int k, int len1, int len2){
int card1=cardinality(k, len1);
int card2=cardinality(k, len2);
double space=Math.pow(4, k);
int kmers=len1-k+1;
double cumulativeProbUnshared=1;
for(int i=0; i<card1; i++){
double probShared=card2/space;
double probUnshared=1-probShared;
space-=probUnshared;
cumulativeProbUnshared*=probUnshared;
}
return 1-cumulativeProbUnshared;
}

}


This does not handle the dangling probability after rounding in the cardinality function, but it could be adjusted to do so. Sample output:

C:\temp>java fun.ProbShared 3 10 17
Cardinality 1: 8
Cardinality 2: 13
Probability:   0.8521277771213935


Edit: As explained below, this does NOT correctly solve the problem of kmers generated from strings (which all overlap); rather, it solves the problem for sets of completely random numbers. As a result, it gives an overestimate for the probability of two strings sharing kmers by up to ~25% depending on the string length and kmer length. The reason is that if two strings share one kmer, there is a minimum of a 25% chance that they also share the next kmer, and 6.25% chance that they share the kmer after that, rather than a flat 1/(4^K) chance if the kmers were non-overlapping. Therefore, if two strings don't share a specific kmer, there must be a similarly reduced chance that they don't share the next kmer. It's probably possible to update the code to be correct given this information.

0
Entering edit mode

Here's the simulation version:

package fun;

import java.util.HashSet;
import java.util.Random;

import dna.AminoAcid;

public class ProbShared2 {

public static void main(String args[]){
int k=Integer.parseInt(args[0]);
int len1=Integer.parseInt(args[1]);
int len2=Integer.parseInt(args[2]);
int rounds=Integer.parseInt(args[3]);

System.out.println("Probability:   "+simulate(k, len1, len2, rounds));
}

static double simulate(int k, int len1, int len2, int rounds){
int successes=0;
final HashSet<Long> set=new HashSet<Long>();
for(int i=0; i<rounds; i++){
successes+=simulateOnePair(k, len1, len2, set);
}
return successes/(double)rounds;
}

static int simulateOnePair(int k, int len1, int len2, HashSet<Long> set){
set.clear();

final int shift=2*k;
long kmer=0;
int len=0;

byte[] bases=randomSequence(len2);

for(int i=0; i<bases.length; i++){
byte b=bases[i];
long x=baseToNumber[b];
if(x<0){len=0;}else{len++;}
if(len>=k){
}
}

bases=randomSequence(len1);

for(int i=0; i<bases.length; i++){
byte b=bases[i];
long x=baseToNumber[b];
if(x<0){len=0;}else{len++;}
if(len>=k){
if(set.contains(kmer)){return 1;}
}
}
return 0;
}

static byte[] randomSequence(int len){
byte[] array=new byte[len];
for(int i=0; i<len; i++){
int number=randy.nextInt(4);
array[i]=numberToBase[number];
}
return array;
}

static final Random randy=new Random();
static final byte[] numberToBase=AminoAcid.numberToBase;
static final byte[] baseToNumber=AminoAcid.baseToNumber;

}


Note that the answers are diverging slightly, so my initial solution does not appear to be quite correct. This could be due to the fact that kmers are not randomly distributed; some clump more tightly than others.

C:\temp>java fun.ProbShared 3 10 17
Cardinality 1: 8
Cardinality 2: 13
Probability:   0.8521277771213935

C:\temp>java fun.ProbShared2 3 10 17 1000000
Probability:   0.850832

C:\temp>java fun.ProbShared 9 180 200
Cardinality 1: 172
Cardinality 2: 192
Probability:   0.11844142052318785

C:\temp>java fun.ProbShared2 9 180 200 1000000
Probability:   0.094511

1
Entering edit mode

Interestingly, it does indeed look like the difference is based on kmer clumping, meaning they are not independently random. I made a 3rd version that simulates random sets of independent elements, rather than generating kmers from synthetic DNA, and it matches the output of the first program:

package fun;

import java.util.HashSet;
import java.util.Random;

public class ProbShared3 {

public static void main(String args[]){
int k=Integer.parseInt(args[0]);
int len1=Integer.parseInt(args[1]);
int len2=Integer.parseInt(args[2]);
int rounds=Integer.parseInt(args[3]);

System.out.println("Probability:   "+simulate(k, len1, len2, rounds));
}

static double simulate(int k, int len1, int len2, int rounds){
int successes=0;
final HashSet<Long> set=new HashSet<Long>();
for(int i=0; i<rounds; i++){
successes+=simulateOnePair(k, len1, len2, set);
}
return successes/(double)rounds;
}

static int simulateOnePair(int k, int len1, int len2, HashSet<Long> set){
fillRandomSet(k, len2, set);
final long space=(long)Math.pow(4, k);
final int kmers=len1-k+1;
for(int i=0; i<kmers; i++){
long kmer=(randy.nextLong()&Long.MAX_VALUE)%space;
if(set.contains(kmer)){return 1;}
}
return 0;
}

static void fillRandomSet(int k, int len, HashSet<Long> set){
set.clear();
final long space=(long)Math.pow(4, k);
final int kmers=len-k+1;
for(int i=0; i<kmers; i++){
}
}

static final Random randy=new Random();

}


Output:

C:\temp>java fun.ProbShared 9 180 200
Cardinality 1: 172
Cardinality 2: 192
Probability:   0.11844142052318785

C:\temp>java fun.ProbShared2 9 180 200 1000000
Probability:   0.094364

C:\temp>java fun.ProbShared3 9 180 200 1000000
Probability:   0.118574


Of course, only ProbShared2 is actually correct, but ProbShared is far faster and deterministic. I'm not really sure how to integrate the fact that kmers are mutually dependent into the deterministic calculation. Essentially, every time an unshared kmer is seen, the subsequent kmer is slightly more likely than expected by random chance to also be unshared, because they overlap.

1
Entering edit mode

This is amazing, thank you for your input !

0
Entering edit mode

You're welcome! I don't normally put so much effort into random problems, but this is directly related to a lot of the programs that I write, so I found it quite interesting.

The question of whether two sequences share any kmer (with overlapping kmers) is relevant to BBMap and Seal run in normal mode, while the question of whether two sequences share random non-overlapping kmers is relevant to BBMap and Seal run in low-memory mode (in which case they use a random subset of the kmers) as well as Dedupe, Clumpify, and Sketch, which all use non-overlapping kmers.

1
Entering edit mode
2.0 years ago
Martijn ▴ 10

There is a recent related problem on stackexchange Probability of a similar sub-sequence of length X in two sequences of length Y and Z.

One way to estimate the probability is to compute the expectation value for the number of matches of a sub-sequences of length 'k' and consider the distribution Poisson distributed. From this, you can compute the probability of zero occurrences. This estimate needs to be corrected slightly because there is some correlation, if one subsequence occurs, then it is more likely for a second one to occur. The final result is

P(n≥1) = 1−exp(−E[n]/(1+E[S]))

with E[n] = (y−x+1)(z−x+1)(1/k)^x and E[S] = 1/(k−1)

where y and z are the lengths of the sequences, x the length of the subsequence, and k the size of the alfabet (4 in the case of nucleotides).

With y = 180, z = 200, x = 9, k = 4 (from the other answer) we get the estimate p = 0.09015627

The R-code below computes two strings a million times and finds roughly the same answer p = 0.090055

sim = function(n1=100,n2=100, l = 9){
### create two sequences
s1 = sample(c("a","c","g","t"), n1, replace=TRUE)
s2 = sample(c("a","c","g","t"), n2, replace=TRUE)
### perform a loop that searches with grepl for every possible substring from s1 in the string made from s2
count = 0
string2 = paste0(s2, collapse = "")
for (i in 1:(n1-l+1)) {
substring = paste0(s1[c(1:l)+i-1], collapse = "")
if (grepl(substring, string2)) {
count = 1
}
}
return (count)
}

x = 9
y = 180
z = 200
k = 4

### this returns 0.09015627
1-exp(-(z-x+1)*(y-x+1)*(1/k)^x*(k-1)/k)

### this returns 0.090055
set.seed(1)
nt = 10^6
sum(replicate(nt,sim(y,z,x)))/nt