Question: How To Generate The Set Of Sequences At A Given Edit Distance From A Reference?
7
9.4 years ago by
Jarretinha3.3k
São Paulo, Brazil
Jarretinha3.3k wrote:

Hi there,

Now a practical question. I need to generate the set of sequences at a given edit distance from a reference. Let's assume Levenshtein distance for the matter. From example, I have a sequence of a TFBS and want to generate all sequences with edit distance equal 2. Of course, efficiency is very important as the problem tends to become very very large rapidly with increasing distance/reference size.

I've already searched Knuth collection and the usual sources. But found only efficient solutions for comparison.

sequence • 4.0k views
modified 9.4 years ago by Bilouweb1.1k • written 9.4 years ago by Jarretinha3.3k

if you come up with a simple implementation, can you post a link to some code? this is an interesting problem.

I will certainly post the code.

4
9.4 years ago by
Bilouweb1.1k
Saclay, France
Bilouweb1.1k wrote:

If I understood your problem, I think it corresponds to enumerate all possible combinations of substitutions on all combinations of k positions in a sequence. (k is the edit distance)

This can be done by Knuth algorithm for combination enumeration. First you enumerate combinations of positions then you enumerate combinations of modifications.

I made a brute force solver for alignments based on this idea. But you can not avoid complexity.

For example, if you want to generate sequences at an edit distance 2 (k) from a reference sequence of length 5 (m) with an alphabet of 4 (a) letters:

First, you choose 2 positions from 5 without repeats (10 pairs)

Second, on these positions, you enumerate the combinations of modifications with repeats (6).

So you have 60 sequences.

I think the number of solutions is like that : `(m! / (k!*(m-k)!)) * ((a+k-2)!/(k!(a-2)!))`

I hope it will helps you ;)

Bilou.

Edit :

After discussion, I think the best way to generate your sequences is the following :

• Generate sequences with distance k with 0 indel
• Generate sequences with distance k-1 with 1 indel
• Generate sequences with distance k-2 with 2 indels
• ...
• Generate sequences with distance 0 with k indels

It is a huge number of sequences !!

But, what about indels? The number of solutions you gave are for fixed size. Levenshtein allows indels. This is the real problematic point.

I think it works the same if indel is a symbol of your alphabet.

I think it not. Now, in your example one has to deal with strings of size 3-7. So, there is something missing in your formula. And, for implementation purposes, this problem has a lot of symmetries (reversals, palindromes, etc.) that you can use.

I think I got it ! Consider a modification at a given position as : - a substitution with a letter of the alphabet - or a substitution with an indel - or the insertion of an indel

You have all possible modifications at a given position. Instead of counting letters, let (a) be the number of possible modifications, then the equation holds.

I found this

Oh yeah !!! This post seems to pave the way correctly !!! I think I can handle the symmetry part. Never saw Scala used in this context. I'll mark this as the correct answer. But, further additions are welcome, of course.

Happy to help you ;) But I think there is another problem with the fact that you can add k indels at a same position... I have to think about that.

Yeah. Now you really got the point. But the stackoverflow post gives a good hint on how to generate possible paths. I know that's a huge number of sequences. That's why I'm asking for tips on how to implement it. But, as you precisely noted, the combinatorics of the problem is interesting too. Now, imagine that after this part I will add some physical calculations on each sequence. Got the gist?

Yeah ! Sounds like a crazy number of calculations !

I have implemented the generator in C++. If you want the code, just ask.

1
9.4 years ago by
Cambridge, MA

Isn't the underlying problem here just going to be the number of potential sequences? That is, I'm not sure there is going to be a way to reduce the time complexity if you need to enumerate them all.

I suppose you could build up some sort of tree that does represent all sequences in a compressed space. Or if you need to do some computation on that set of sequences, there are probably tricks (like the suffix-array indexing in BWA etc.) where you could compute on big parts of your sequence space at once, without having to explicitly enumerate it...