Question: Trying To Find Appropriate Algorithm, Similar To String Alignment
6
7.5 years ago by
Jesse J150
Jesse J150 wrote:

I have three lists of sequences that I need to compare to each other, finding where they completely match. For example, if I have `AGCTATGCTCTATCGTACGT`, `TTAGCTATGCTCTATCGT`, and `AGCTATGCTCTATCGTACGTTTCA` then they would match as

``````    AGCTATGCTCTATCGTACGT
TTAGCTATGCTCTATCGT
AGCTATGCTCTATCGTACGTTTCA
``````

They match in whole segments, no mismatches or gaps, but they can have extra letters at the start or end of the matching string. I could go ahead and write my own algorithm to do this, but I'm sure there's already an algorithm out there that handles this. I'm aware of algorithms such as this, but that takes gaps and mismatches into account, which sounds like more computation than is needed.

Anyone know of such an algorithm, or something that can point me in the right direction?

UPDATE: At first I thought Philippe had the answer I was looking for (especially with a link to working Python code), but then I realized the algorithm isn't quite what I need. For example, the longest-common substring of `abba` and `cbbc` returns `bb`, where it should return null because `c` doesn't match with `a`.

RESOLVED: I decided to go and make my own algorithm. It's not the longest common substring, but it pointed me in the right direction, and runs in roughly the same amount of time.

sequence • 2.1k views
modified 7.4 years ago by Jeremy Leipzig18k • written 7.5 years ago by Jesse J150
7
7.5 years ago by
Andrew Su4.8k
San Diego, CA
Andrew Su4.8k wrote:

Unless computational performance is critical for your application (it doesn't sound like it is), then why not just use any old multiple sequence alignment program? Set your gap penalties to be high so the algorithm doesn't chase down dead ends and that will probably minimize the added overhead. In any case, I would guess that would be faster than trying to track down an ungapped MSA program. On the scale of things, you have a very simple problem so I'd guess the computation wouldn't take long anyway.

3
7.5 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

This sounds like a problem for Uclust.

2
7.5 years ago by
Philippe1.9k
Barcelona, Spain.
Philippe1.9k wrote:

What you try to do is to define the Longest Common Substring, there are several implementations of such algorithms available but there are generally restricted to pairwise comparisons.

The problem seems to be generalized to more than two sequences and is then referred as the k-common substring problem. You can find several articles or resources googling this term but that was starting to get too technical for me. =)

2
7.5 years ago by
Jeremy Leipzig18k wrote:

yes this is clustering

http://www.vmatch.de/