Question

Need Help Understanding Spliced Sequence Alignment Algorithm

5

Entering edit mode

12.1 years ago

corn8bit ▴ 140

I'm reading Jones & Pevzner's Introduction to Bioinformatics Algorithms and am stuck at the end of chapter 6.14: spliced alignment problem. (research paper here)(slides on the chapter here including pseudocode at slide 28)

I understand that if I have lots of local alignments between a sample and reference DNA that if I just took the largest weighted non-overlapping path then I would get a nonsensical protein (because the protein is cut up in different orders on both strands). Example:

Sample strand (x = intron): 789xxxx123xxxx146xxxx456xxx

Reference protein: 123456789

This would align 789123146456 if we just took the weight and overlap into account!

So the book starts talking about this strange dynamic programming directed weighted graph (DAG) to solve the problem, which looks different from any DAG we've seen in the book so far, and for the life of me I can't wrap my head around the algorithm or pseudocode.

My question: Why can't I just simply align each local alignment to the reference protein, and THEN take the largest weighted non-overlapping path? Like this:

[1    4 6]  ---local alignments
[123][456][789]  ---local alignments
 123  456  789  ---reference protein
= [123] [456] [789]

And if this is wrong, could you please explain the correct DAG algorithm in a way I could understand (I understand the book up to here, and DAGs and dynamic programming)? Thanks!

Edit: It appears that the book's algorithm is scanning every single possible chain at every point in an N^2 array... this can't be right!

Edit2: The algorithm (see slide 28 which I linked) generates a 3d graph! I think I almost understand this! Will post back when I do.

alignment homework algorithm • 4.9k views

ADD COMMENT • link 12.1 years ago by corn8bit ▴ 140

score 4 · Answer 1 · 2013-05-21

Alignment chaining problem. This is to find a subset of colinear local alignments that optimizes an objective function (e.g. sum of local alignment scores). The problem can be reduced to finding the optimal path on a DAG and can be solved by dynamic programming in quadratic time. For some types of objective functions, it has O(nlogn) solutions. There are quite a few papers on optimal chaining. It has been widely used in genome-wide aligners and in cDNA/EST aligners. The longest common subsequence (LCS) problem is a special case of alignment chaining. I think your solution is essentially alignment chaining.
Exon chaining problem (in the book). I think using "chaining" here is unusual. To me, chaining implies colinearity, but exon chaining here does not. The use of word "chaining" confused me for some time. Anyway, I have not seen this problem used in practice.
Spliced alignment problem (in the book). The following is my understanding, which could be wrong. Unlike alignment chaining, spliced alignment ignores the local alignment of each block and reconstructs the alignment while chaining. It is essentially doing alignment chaining and Smith-Waterman alignment at the same time.
The genewise model. GeneWise is different from the above in that it does not require a list of exon blocks. It simply aligns a protein to a genomic region in a splicing-aware way. It is using dynamic programming. GeneWise used to be the best program to predict genes.

The spliced alignment problem in the book is theoretically interesting, but I am not sure if it is practically advantageous. If it still scans every base in the reference, it will have similar or worse time complexity to GeneWise but may be less accurate; if it jumps between exon blocks, alignment chaining may achieve similar accuracy and be faster. It would be better to introduce alignment chaining and the GeneWise model in my opinion.

score 2 · Answer 2 · 2013-05-22

I figured it out! I recommend reading pages 1 and 2 of the research paper before continuing.

Essentially we are in fact doing what I described in the question. We're aligning the blocks to the protein, and taking the best non overlapping path. The clever part is that the algorithm does it all in one function!

The input to the algorithm is the following 3 variables:

Mystery_dna = The big test DNA strand with lots of exons and introns which we'll get our local alignment blocks from.

Known_protein = The DNA sequence which we're comparing our blocks to.

Blocks = Every single block from the Mystery_dna which matches a section in Known_protein. These are the local alignments.

The output is going to be the score for the best aligned path of blocks. (if we wanted the path we could easily enough store path information as we went, but for simplicity we should not do that now).

Let's set up our variables now.

Mystery_dna[i] = the value of the i point in the DNA strand
  (example: ATCGATCG would be G at i = 4)
Known_protein[j] = the value of the j point in the protein strand
Blocks[k] = one block of nucleotides 
Up_to_block[k] = all the blocks up to and excluding the ones that contain Mystery_dna[k], this is to weed out overlaps.
  (example: Mystery_dna = ATCG
  Blocks = [AT][GC]
  Up_to_block[3] = The blocks not including Mystery_dna[3] = [AT])
Last (Block[k]) = The last position in Block k, which will match that position on the Mystery_dna.
First (k) = The first position in Block k.
#this is a comment, which I'll start with a #

Now that we have our variables, we should go over what happens in the recursion. What will happen is we'll generate a 3d table which is just the same 2d table dynamic programming thing that is explained all through chapter 6 of the book with an added dimension of Blocks. At each iteration we'll go through each current point at each block. If we're at the start of any block then we need to use the Up_to_block function and to weed out overlaps and use the value of the greatest Block's last point (up to Block[current]). We'll use +1 for a match and -1 for an indel or mismatch for simplicity.

The recursion: (remember, i, j ,k are indices for our inputs)

Score (i, j, k):
  Return Max of 
    # These 2 lines are normal dynamic programming, just finding the best path along the protein.
    if i != First (k): Score (i-1, j-1, k) + 1 
    if i != First (k): Score (i-1, j, k) - 1
    # THIS is the hard part! If we're at the start of our block-
    # (and remember we'll go through each point for each block)
    # -then continue the value from the greatest, non-overlapping block up to this point. It won't 
    # overlap because of "Up_to_block[k]"
    if i == First (k): Return Max of the following for all cur_block in Up_to_block[k] 
      Score (Last (cur_block), j-1, cur_block) + 1 
    # This line just happens if we have a mismatch at our first position in the block.
    if i == First (k): Return Max of the following for all cur_block in Up_to_block[k] 
      Score (Last (cur_block), j, cur_block) - 1 
    Score (i, j-1, k) -1

Now, if you made it this far you have a 3d array of values, and the one in one of the Blocks at the end of the array will be our score.

Return the Max of this for each block in Blocks:
  Array [Last[block], end_pos_of_protein, block]

And that will be the max score!

Note: lh3's answer where he mentioned doing Waterman and chaining at the same time was key to me figuring this out, so I'm marking that as the answer.