I'm reading Jones & Pevzner's Introduction to Bioinformatics Algorithms and am stuck at the end of chapter 6.14: spliced alignment problem. (research paper here)(slides on the chapter here including pseudocode at slide 28)
I understand that if I have lots of local alignments between a sample and reference DNA that if I just took the largest weighted non-overlapping path then I would get a nonsensical protein (because the protein is cut up in different orders on both strands). Example:
Sample strand (x = intron): 789xxxx123xxxx146xxxx456xxx
Reference protein: 123456789
This would align 789123146456
if we just took the weight and overlap into account!
So the book starts talking about this strange dynamic programming directed weighted graph (DAG) to solve the problem, which looks different from any DAG we've seen in the book so far, and for the life of me I can't wrap my head around the algorithm or pseudocode.
My question: Why can't I just simply align each local alignment to the reference protein, and THEN take the largest weighted non-overlapping path? Like this:
[1 4 6] ---local alignments
[123][456][789] ---local alignments
123 456 789 ---reference protein
= [123] [456] [789]
And if this is wrong, could you please explain the correct DAG algorithm in a way I could understand (I understand the book up to here, and DAGs and dynamic programming)? Thanks!
Edit: It appears that the book's algorithm is scanning every single possible chain at every point in an N^2 array... this can't be right!
Edit2: The algorithm (see slide 28 which I linked) generates a 3d graph! I think I almost understand this! Will post back when I do.
Interesting. I'm now looking into the GeneWise algorithm, and I'd be interested to know what algorithm is the current best gene predition one.
" It is essentially doing alignment chaining and Smith-Waterman alignment at the same time. "
My mind is now blown. It does Waterman and chaining at the same time, yet is suboptimal? How can it do both at the same time? I would be very grateful if you could summarize what's happening in the book's algorithm through explaining one iteration of the process. Also I appriciate all the time you've already spent on this!
On the spliced alignment, I realize later that it should be possible to jump between blocks. Then it could be faster than GeneWise. But I still think chaining will be much faster and give similar albeit slightly lower accuracy. As to why spliced alignment is suboptimal, exon blocks are usually determined by heuristic. They can be wrong. Meanwhile, you don't really need these blocks. Doing a dynamic programming in one go like GeneWise is likely to give more accurate result. As to which is the state-of-art protein-to-genome aligner, I think it is Exonerate. I am not so sure, though.
My understanding of the spliced alignment comes from the text description. It is possible and sensible to do SW inside blocks. SW is a dynamic programming; chaining is, too. There should naturally be a dynamic programming for the integrated problem. You only need to augment the state space. However, I haven't read the equations. That is why I said I could be wrong.