Efficient way of calculating the number of mismatches between two sets of sequences please?
3
2
Entering edit mode
9.2 years ago
Aurelie MLB ▴ 360

Hello,

I am working with R and Bioconductor packages. I have set big sets of sequences such as:

A=c("TATTCCTAGGTTCGCCT", "TTCCCGTTGCCCAGTGA"....) # length ~ 500
B=c("CTCCACACCAAAGCATC", "AACTGTGAGATTAATCT") # length ~10 000 000


What I would like to know ultimately is which are the sequences from B that match each sequence of A with a most 5 mismatches. e.g.: something like

res$A1 5, 5000,8 000 000... res$A2
3005, 7560,5 003 542...


I could do loops or some "apply"... but it is taking ages...

I looked on the PDict, matchPDict, vwhichPDict side as well. It is much more efficient. But My sequences are too short: PDict would not let me set the max.mismatch parameter to 5.

As the sequences from A and B are exactly of the same length, I do not need searches or alignments. I probably just need to calculate the number of mismatches directly. But I cannot find a way of doing it really efficiently and quickly.

Many thanks

alignment genome sequence • 7.7k views
2
Entering edit mode

Thanks a lot for your suggestions! I wanted to double check with experts that I was not missing anything obvious already implemented in R. I finally tried to write a C function that does what I need and called it from R. I takes around 400 seconds to run. Amazing compared to R or python!

1
Entering edit mode

Well done! For the record, a couple of comments.

1) The python implementation is not as bad as I thought. For 500 * 10,000,000 comparisons it should take around ~1/2 hour or less to calculate the Levenshtein distance (~1500s, see test below):

import Levenshtein as lv
import time
s1= "TATTCCTAGGTTCGCCT"
s2= "TATTCCTAGGTTCGCCN"
t0= time.time()
i= 0
while i < 10000000:
ld= lv.distance(s1, s2)
i+=1

t1= time.time()
print((t1-t0)*500) ## >>> ~1516.85

2) You probably realized this already: Since you say that only substitutions are allowed than the Levenshtein distance is unnecessarily expensive to compute. The Hamming distance seems what you need and it should be much faster, I'd guess a 10x faster?

0
Entering edit mode

Possibly more than 10x faster since it's more of a bounded Hamming distance (i.e., the function can simply return once the distance >5).

0
Entering edit mode

400 seconds still seems a bit on the slow end, though I guess it's fast enough.

0
Entering edit mode

Are you allowing insertions/deletions or only substitutions?

0
Entering edit mode

Hello! Only substitutions.

2
Entering edit mode
9.2 years ago

I don't know how slow R is going to be. If you need to do this only once in a while, this simple, almost brute force approach could do (it's python):

import Levenshtein as lv
import time
set1= ["TATTCCTAGGTTCGCCT", "TTCCCGTTGCCCAGTGA"]
set2= ["CTCCACACCAAAGCATC", "AACTGTGAGATTAATCT", "TATTCCTAGGTTCGCCT", "TTCCCGTTGCCCAGTGA", "TTCCCGTTGCCCAGTGC", "TATTCCTAGGTTCGCCA"]

for s1 in set1:
for s2 in set2:
d12= lv.distance(s1, s2)
if d12 < 5:
print(s1, s2, d12)



For each sequence in set1 it computes the Levenshtein distance with each sequence in set2 and it prints out the pair if the distance is less than 5. For 500 * 10000000 it should take maybe few hours.

0
Entering edit mode

Hi, thanks a lot for your idea! Unfortunately it does not do the trick for me as I need to run quite often as part as another program...

0
Entering edit mode

Then why don't you convert the above code in to a function and use it ?

0
Entering edit mode

Thanks a lot! My point was just that I would need a quicker solution as I need to do this hundreds of times...

1
Entering edit mode
9.2 years ago

Perhaps you could still use the efficient string matching function vmatchPattern in Biocondoctor's Biostrings package, like so:

A <- DNAStringSet(A)
B <- DNAStringSet(B)
res <- lapply(A, vmatchPattern, B, max.mismatch=5)

Does that do the trick for you?

0
Entering edit mode

Hi! Thanks a lot for your answer. I tried this and unfortunately it is still quite slow because of the lapply...I think I would need an algorithm / a function from a package that does all the loop internally (maybe coded in C ?) and that is ultra optimised...

0
Entering edit mode

The lapply can't be that slow, though, right? It's only looping through ~ 500 strings.

I'm not doubting that the entire endeavor is slow, but given the small size of A, I guess you're saying it's the vmatchPattern that's actually slow?

0
Entering edit mode

If vmatchPattern was able to loop internally on A that would probably be a great solution. But unfortunately it does not...

1
Entering edit mode
12 weeks ago
j.gleixner ▴ 20

Since all your sequences have the same length (say M) and you are only looking concerned with hamming distance you can encode your sequences into a M*4 dimensional space and use knn algorithms to find the matches:

onehot_encode <- function(x) {
stopifnot(all(nchar(x[1])==nchar(x)))
t(matrix(1L*cbind(A=c(F,F,F,T), C= c(F,F,T,F), G=c(F,T,F,F), T=c(T,F,F,F))[,unlist(strsplit( x, split= ""))], nrow=4*nchar(x[1])))
}
A <- c("ATCGGTG", "TTGCCTG")
B <- c("ATCAGTG", "TTGCCTG","GTCGGTG", "TTACCTG")
res <- FNN::get.knnx(onehot_encode(A), onehot_encode(B), k=2)
with(res, data.frame(idx=as.vector(nn.index), hamming_dist = as.vector(nn.dist^2)/2, B, matching_A=A[as.vector(nn.index)]))

  idx hamming_dist       B matching_A
1   1            1 ATCAGTG    ATCGGTG
2   2            0 TTGCCTG    TTGCCTG
3   1            1 GTCGGTG    ATCGGTG
4   2            1 TTACCTG    TTGCCTG
5   2            4 ATCAGTG    TTGCCTG
6   1            4 TTGCCTG    ATCGGTG
7   2            4 GTCGGTG    TTGCCTG
8   1            4 TTACCTG    ATCGGTG


Although it encodes each 4 bit nucleotide with 4 bytes this should be rather efficient scale with something like O(length(B) log(length(A)) M *log(M)) or so:

A=rep(c("TATTCCTAGGTTCGCCT", "TTCCCGTTGCCCAGTGA"), 250) # length ~ 500
B=rep(c("CTCCACACCAAAGCATC", "AACTGTGAGATTAATCT"), 5000000) # length ~10 000 000
res <- FNN::get.knnx(onehot_encode(A), onehot_encode(B), k=2)


takes about a minute on our server.

FNN does not allow to find only all matches within a radius, but you can implement something around or use a different knn library in case you really need it (i.e. finding the k best matches is not sufficient)

However, in my experience you usually want to use Levenshtein distance instead.