Alignment Algorithm
1
5
Entering edit mode
10.7 years ago

Hi all, I'm playing with an alignment algorithm, that should look like BLAT (I think).

From the Reference sequence, I've sorted all the words of size=K

AAAAAAACT  index in ref=1901
AAAAAAACT  2504
AAAAAACTT  2505
(...)
TTTTTTACC  189
TTTTTTTTA  87


when we want to map a sequence on the chromosome, the query is also indexed

 AAAAAAACG  20
AAAAAACGA  21
(...)
TTAATGGAG  4


so, the words of the query can be quickly mapped to the reference using a binary-search. At the end, I got an array of hits:

struct Hit
{
int position_in_ref;
int position_in_query;
int word=K;
}


that should look like this (the figure is from the BLAT paper)

Now, how should I use this information to get the best alignment(s). I suspect a simple solution should be something like a dynamic algorithm but I'm lost at this point. Any suggestion ?

algorithm alignment pairwise • 3.4k views
1
Entering edit mode

by "alignments", do you mean gapped or ungapped?

0
Entering edit mode

gapped alignments

6
Entering edit mode
10.7 years ago

So you've done your search stage and obtained candidate alignment seeds. Now you need to choose groups to join and fill the gaps. How to select groups of seeds? BLAT merges overlapping seeds to start with. Then you could look for groups that form chains within a certain diagonal band. You can find the band by subtracting the query base coordinate from the reference base coordinate. Then sort everything by band and the chains will be grouped together. I think BLAT recurses at some point, to further join these seeds by repeating the Kmer finding operation locally, with smaller and smaller Kmers.

Or, if you don't want to recreate BLAT exactly, you can use dynamic programming (e.g. Smith Waterman) to stitch them together. If your seeds are good enough, you could assume they are true and use DP to join them up, or you could go for it and DP re-align the whole area to get the best possible result, using banded alignment. A banded alignment means that you can ignore the DP cells outside your diagonal band and speed up the calculation. When you say "best" alignment, that implies to me a rigorous search i.e. SW, but maybe I've misunderstood that point.