Question

Multiple alignments with same score in Smith-Waterman

0

Entering edit mode

18 months ago

amansinha.sw • 0

I am working on reducing the runtime of Smith-Waterman algorithm to enhance performance of GATK Haplotypecaller. While working on a dataset, I observed the possibility that multiple alignments can possibly have the same maximum score possible between two sequences being aligned. Since Bioinformatics research is not my core domain, I wanted to ask the community whether there is any implication of selecting any one of those alignments with maximum score on the downstream analysis system. My algorithm can perform well if there is no issue with accepting any one of those alignments with maximum score. Please suggest.

Haplotypecaller Smith-Waterman GATK • 712 views

ADD COMMENT • link updated 18 months ago by Istvan Albert 100k • written 18 months ago by amansinha.sw • 0

1

Entering edit mode

Not sure what you're getting at by reducing the runtime of SW. By now, alignment algorithms have been around for a few decades and optimization efforts have been ongoing so I would suggest to review what's been done in the domain over the last ~30 years before re-inventing the wheel. SW has already been optimized by Altschul and Erikcson from (O(m^2n) to O(mn). Then multiple CUDA implementations are available like here and here.

If your implementation preserves the optimal alignment guarantee of the original algorithm then all maximum scoring sequences represent optimal alignments but picking one over the other is not neutral for downstream analysis.

ADD REPLY • link 18 months ago by Jean-Karim Heriche 27k

score 2 · Answer 1 · 2022-11-01

Depending on the implementation of the algorithm that you use, sometimes identical alignments may be reported as separate alignments. For example

GA-T
GAAT

and

G-AT
GAAT

may or may not be reported as different alignments. In those cases, the alignments represent the same arrangement yet would be reported as different. I would first ensure that does not happen. Read up on variant normalization to see how people deal with it:

Unified representation of genetic variants Bioinformatics. 2015

But in many other cases, maximally scoring alignments are non-identical. Usually, this can be very pronounced when the sequences are dissimilar and when the scoring is compressed (the values of rewards and penalties are similar), you can end up with hundreds or thousands of equally scoring alignments.

In those cases, the alignments are technically all optimal yet quite different.