Question: How To Generate The Set Of Sequences At A Given Edit Distance From A Reference?
7
gravatar for Jarretinha
9.4 years ago by
Jarretinha3.3k
São Paulo, Brazil
Jarretinha3.3k wrote:

Hi there,

Now a practical question. I need to generate the set of sequences at a given edit distance from a reference. Let's assume Levenshtein distance for the matter. From example, I have a sequence of a TFBS and want to generate all sequences with edit distance equal 2. Of course, efficiency is very important as the problem tends to become very very large rapidly with increasing distance/reference size.

I've already searched Knuth collection and the usual sources. But found only efficient solutions for comparison.

sequence • 4.0k views
ADD COMMENTlink modified 9.4 years ago by Bilouweb1.1k • written 9.4 years ago by Jarretinha3.3k

if you come up with a simple implementation, can you post a link to some code? this is an interesting problem.

ADD REPLYlink written 9.4 years ago by brentp23k

I will certainly post the code.

ADD REPLYlink written 9.4 years ago by Jarretinha3.3k
4
gravatar for Bilouweb
9.4 years ago by
Bilouweb1.1k
Saclay, France
Bilouweb1.1k wrote:

If I understood your problem, I think it corresponds to enumerate all possible combinations of substitutions on all combinations of k positions in a sequence. (k is the edit distance)

This can be done by Knuth algorithm for combination enumeration. First you enumerate combinations of positions then you enumerate combinations of modifications.

I made a brute force solver for alignments based on this idea. But you can not avoid complexity.

For example, if you want to generate sequences at an edit distance 2 (k) from a reference sequence of length 5 (m) with an alphabet of 4 (a) letters:

First, you choose 2 positions from 5 without repeats (10 pairs)

Second, on these positions, you enumerate the combinations of modifications with repeats (6).

So you have 60 sequences.

I think the number of solutions is like that : (m! / (k!*(m-k)!)) * ((a+k-2)!/(k!(a-2)!))

I hope it will helps you ;)

Bilou.

Edit :

After discussion, I think the best way to generate your sequences is the following :

  • Generate sequences with distance k with 0 indel
  • Generate sequences with distance k-1 with 1 indel
  • Generate sequences with distance k-2 with 2 indels
  • ...
  • Generate sequences with distance 0 with k indels

It is a huge number of sequences !!

ADD COMMENTlink modified 11 months ago by RamRS23k • written 9.4 years ago by Bilouweb1.1k

But, what about indels? The number of solutions you gave are for fixed size. Levenshtein allows indels. This is the real problematic point.

ADD REPLYlink written 9.4 years ago by Jarretinha3.3k

I think it works the same if indel is a symbol of your alphabet.

ADD REPLYlink written 9.4 years ago by Bilouweb1.1k

I think it not. Now, in your example one has to deal with strings of size 3-7. So, there is something missing in your formula. And, for implementation purposes, this problem has a lot of symmetries (reversals, palindromes, etc.) that you can use.

ADD REPLYlink written 9.4 years ago by Jarretinha3.3k

I think I got it ! Consider a modification at a given position as : - a substitution with a letter of the alphabet - or a substitution with an indel - or the insertion of an indel

You have all possible modifications at a given position. Instead of counting letters, let (a) be the number of possible modifications, then the equation holds.

ADD REPLYlink written 9.4 years ago by Bilouweb1.1k

I found this

ADD REPLYlink modified 13 hours ago by RamRS23k • written 9.4 years ago by Bilouweb1.1k

Oh yeah !!! This post seems to pave the way correctly !!! I think I can handle the symmetry part. Never saw Scala used in this context. I'll mark this as the correct answer. But, further additions are welcome, of course.

ADD REPLYlink written 9.4 years ago by Jarretinha3.3k

Happy to help you ;) But I think there is another problem with the fact that you can add k indels at a same position... I have to think about that.

ADD REPLYlink written 9.4 years ago by Bilouweb1.1k

Yeah. Now you really got the point. But the stackoverflow post gives a good hint on how to generate possible paths. I know that's a huge number of sequences. That's why I'm asking for tips on how to implement it. But, as you precisely noted, the combinatorics of the problem is interesting too. Now, imagine that after this part I will add some physical calculations on each sequence. Got the gist?

ADD REPLYlink written 9.4 years ago by Jarretinha3.3k

Yeah ! Sounds like a crazy number of calculations !

ADD REPLYlink written 9.4 years ago by Bilouweb1.1k

I have implemented the generator in C++. If you want the code, just ask.

ADD REPLYlink written 9.4 years ago by Bilouweb1.1k
1
gravatar for Adrian
9.4 years ago by
Adrian680
Cambridge, MA
Adrian680 wrote:

Isn't the underlying problem here just going to be the number of potential sequences? That is, I'm not sure there is going to be a way to reduce the time complexity if you need to enumerate them all.

I suppose you could build up some sort of tree that does represent all sequences in a compressed space. Or if you need to do some computation on that set of sequences, there are probably tricks (like the suffix-array indexing in BWA etc.) where you could compute on big parts of your sequence space at once, without having to explicitly enumerate it...

ADD COMMENTlink written 9.4 years ago by Adrian680

The idea is to enumerate. Suffix arrays deal only with substrings. But, edit distance relates sequences that can be longer or shorter than the reference.

ADD REPLYlink written 9.4 years ago by Jarretinha3.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 924 users visited in the last hour