Question

How To Generate The Set Of Sequences At A Given Edit Distance From A Reference?

7

Entering edit mode

14.6 years ago

Jarretinha 3.4k

Hi there,

Now a practical question. I need to generate the set of sequences at a given edit distance from a reference. Let's assume Levenshtein distance for the matter. From example, I have a sequence of a TFBS and want to generate all sequences with edit distance equal 2. Of course, efficiency is very important as the problem tends to become very very large rapidly with increasing distance/reference size.

I've already searched Knuth collection and the usual sources. But found only efficient solutions for comparison.

sequence • 7.9k views

ADD COMMENT • link updated 2.1 years ago by Craig • 0 • written 14.6 years ago by Jarretinha 3.4k

0

Entering edit mode

if you come up with a simple implementation, can you post a link to some code? this is an interesting problem.

ADD REPLY • link 14.6 years ago by brentp 24k

0

Entering edit mode

I will certainly post the code.

ADD REPLY • link 14.6 years ago by Jarretinha 3.4k

Ram · Answer 1 · 2010-04-02

4

Entering edit mode

14.6 years ago

Bilouweb ★ 1.1k

If I understood your problem, I think it corresponds to enumerate all possible combinations of substitutions on all combinations of k positions in a sequence. (k is the edit distance)

This can be done by Knuth algorithm for combination enumeration. First you enumerate combinations of positions then you enumerate combinations of modifications.

I made a brute force solver for alignments based on this idea. But you can not avoid complexity.

For example, if you want to generate sequences at an edit distance 2 (k) from a reference sequence of length 5 (m) with an alphabet of 4 (a) letters:

First, you choose 2 positions from 5 without repeats (10 pairs)

Second, on these positions, you enumerate the combinations of modifications with repeats (6).

So you have 60 sequences.

I think the number of solutions is like that : (m! / (k!*(m-k)!)) * ((a+k-2)!/(k!(a-2)!))

I hope it will helps you ;)

Bilou.

Edit :

After discussion, I think the best way to generate your sequences is the following :

Generate sequences with distance k with 0 indel
Generate sequences with distance k-1 with 1 indel
Generate sequences with distance k-2 with 2 indels
...
Generate sequences with distance 0 with k indels

It is a huge number of sequences !!

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 14.6 years ago by Bilouweb ★ 1.1k

0

Entering edit mode

But, what about indels? The number of solutions you gave are for fixed size. Levenshtein allows indels. This is the real problematic point.

ADD REPLY • link 14.6 years ago by Jarretinha 3.4k

0

Entering edit mode

I think it works the same if indel is a symbol of your alphabet.

ADD REPLY • link 14.6 years ago by Bilouweb ★ 1.1k

0

Entering edit mode

I think it not. Now, in your example one has to deal with strings of size 3-7. So, there is something missing in your formula. And, for implementation purposes, this problem has a lot of symmetries (reversals, palindromes, etc.) that you can use.

ADD REPLY • link 14.6 years ago by Jarretinha 3.4k

0

Entering edit mode

I think I got it ! Consider a modification at a given position as : - a substitution with a letter of the alphabet - or a substitution with an indel - or the insertion of an indel

You have all possible modifications at a given position. Instead of counting letters, let (a) be the number of possible modifications, then the equation holds.

ADD REPLY • link 14.6 years ago by Bilouweb ★ 1.1k

0

Entering edit mode

I found this

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 14.6 years ago by Bilouweb ★ 1.1k

0

Entering edit mode

Oh yeah !!! This post seems to pave the way correctly !!! I think I can handle the symmetry part. Never saw Scala used in this context. I'll mark this as the correct answer. But, further additions are welcome, of course.

ADD REPLY • link 14.6 years ago by Jarretinha 3.4k

0

Entering edit mode

Happy to help you ;) But I think there is another problem with the fact that you can add k indels at a same position... I have to think about that.

ADD REPLY • link 14.6 years ago by Bilouweb ★ 1.1k

0

Entering edit mode

Yeah. Now you really got the point. But the stackoverflow post gives a good hint on how to generate possible paths. I know that's a huge number of sequences. That's why I'm asking for tips on how to implement it. But, as you precisely noted, the combinatorics of the problem is interesting too. Now, imagine that after this part I will add some physical calculations on each sequence. Got the gist?

ADD REPLY • link 14.6 years ago by Jarretinha 3.4k

0

Entering edit mode

Yeah ! Sounds like a crazy number of calculations !

ADD REPLY • link 14.6 years ago by Bilouweb ★ 1.1k

0

Entering edit mode

I have implemented the generator in C++. If you want the code, just ask.

ADD REPLY • link 14.6 years ago by Bilouweb ★ 1.1k

0

Entering edit mode

I would like the code if you still have it

ADD REPLY • link 2.1 years ago by Craig • 0

score 1 · Answer 2 · 2010-04-02

1

Entering edit mode

14.6 years ago

Adrian ▴ 700

Isn't the underlying problem here just going to be the number of potential sequences? That is, I'm not sure there is going to be a way to reduce the time complexity if you need to enumerate them all.

I suppose you could build up some sort of tree that does represent all sequences in a compressed space. Or if you need to do some computation on that set of sequences, there are probably tricks (like the suffix-array indexing in BWA etc.) where you could compute on big parts of your sequence space at once, without having to explicitly enumerate it...

ADD COMMENT • link 14.6 years ago by Adrian ▴ 700

0

Entering edit mode

The idea is to enumerate. Suffix arrays deal only with substrings. But, edit distance relates sequences that can be longer or shorter than the reference.

ADD REPLY • link 14.6 years ago by Jarretinha 3.4k