Question: Trying To Find Appropriate Algorithm, Similar To String Alignment
6
gravatar for Jesse J
7.5 years ago by
Jesse J150
Jesse J150 wrote:

I have three lists of sequences that I need to compare to each other, finding where they completely match. For example, if I have AGCTATGCTCTATCGTACGT, TTAGCTATGCTCTATCGT, and AGCTATGCTCTATCGTACGTTTCA then they would match as

    AGCTATGCTCTATCGTACGT
  TTAGCTATGCTCTATCGT
    AGCTATGCTCTATCGTACGTTTCA

They match in whole segments, no mismatches or gaps, but they can have extra letters at the start or end of the matching string. I could go ahead and write my own algorithm to do this, but I'm sure there's already an algorithm out there that handles this. I'm aware of algorithms such as this, but that takes gaps and mismatches into account, which sounds like more computation than is needed.

Anyone know of such an algorithm, or something that can point me in the right direction?

UPDATE: At first I thought Philippe had the answer I was looking for (especially with a link to working Python code), but then I realized the algorithm isn't quite what I need. For example, the longest-common substring of abba and cbbc returns bb, where it should return null because c doesn't match with a.

RESOLVED: I decided to go and make my own algorithm. It's not the longest common substring, but it pointed me in the right direction, and runs in roughly the same amount of time.

sequence • 2.1k views
ADD COMMENTlink modified 7.4 years ago by Jeremy Leipzig18k • written 7.5 years ago by Jesse J150
7
gravatar for Andrew Su
7.5 years ago by
Andrew Su4.8k
San Diego, CA
Andrew Su4.8k wrote:

Unless computational performance is critical for your application (it doesn't sound like it is), then why not just use any old multiple sequence alignment program? Set your gap penalties to be high so the algorithm doesn't chase down dead ends and that will probably minimize the added overhead. In any case, I would guess that would be faster than trying to track down an ungapped MSA program. On the scale of things, you have a very simple problem so I'd guess the computation wouldn't take long anyway.

ADD COMMENTlink written 7.5 years ago by Andrew Su4.8k
3
gravatar for Martin A Hansen
7.5 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

This sounds like a problem for Uclust.

ADD COMMENTlink written 7.5 years ago by Martin A Hansen3.0k
2
gravatar for Philippe
7.5 years ago by
Philippe1.9k
Barcelona, Spain.
Philippe1.9k wrote:

What you try to do is to define the Longest Common Substring, there are several implementations of such algorithms available but there are generally restricted to pairwise comparisons.

The problem seems to be generalized to more than two sequences and is then referred as the k-common substring problem. You can find several articles or resources googling this term but that was starting to get too technical for me. =)

ADD COMMENTlink written 7.5 years ago by Philippe1.9k
2
gravatar for Jeremy Leipzig
7.5 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

yes this is clustering

http://www.vmatch.de/

ADD COMMENTlink written 7.5 years ago by Jeremy Leipzig18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1361 users visited in the last hour