Question

How to Detect Frameshifts in a Protein Alignment

0

Entering edit mode

4.2 years ago

nickeener ▴ 60

I'm performing multiple alignments on a series of protein sequences but I want to throw out any sequence that has a frameshift mutation compared to a reference sequence. I'm starting with in-frame genes which are then converted into their amino acid sequences. My original idea was to simply run a local pairwise alignment between each sequence and the reference and see if there were any points in the alignment after which the two sequences become significantly different but I'm not sure how to reliably determine when the two sequences diverge.

Any ideas would be greatly appreciated. Thanks!

Frameshift Mutation Detection Sequence Alignment • 2.4k views

ADD COMMENT • link updated 4.2 years ago by Mensur Dlakic ★ 27k • written 4.2 years ago by nickeener ▴ 60

0

Entering edit mode

How about using blast to align gene sequences to the reference or use the pairwise alignment that you have. Then any insertion/deletion starts a frameshift, but a deletion followed by an insertion or vice versa will fix the frameshift. Three deletions or three insertions will also fix a frameshift....

ADD REPLY • link 4.2 years ago by Fatima ▴ 1000

score 2 · Answer 1 · 2020-02-18

2

Entering edit mode

4.2 years ago

Brice Sarver ★ 3.8k

What you're looking for is generally called a translation alignment. You'll start with a set of nucleotide sequences. These are converted to amino acids based on a translation table, aligned, and back-translated to their original nucleotides. This will make sure that the protein-coding genes are kept in-frame.

Check out TranslatorX. T-Coffee may be helpful for you in this case, especially to compare among different algorithms; their visualizations show a conservation score that may naturally identify breakpoints. You can also have it guess the most-likely reading frame (likely by looking at the 3 forward and 3 reverse translations). There are other tools available that you can track down with some searching. You'll be able to get this relative to your reference by including the reference in the MSA.

You can also do this in a pairwise fashion, as you suggested.

EDIT: I assumed that you wanted to avoid frameshifts within homologous/orthologous sequences, but you could be looking for these in a disease context. Perhaps the key is simply visualizing here; maybe Jalview could be of use?

ADD COMMENT • link 4.2 years ago by Brice Sarver ★ 3.8k

1

Entering edit mode

Thanks for the answer! What I'm really trying to do is write my own translation aligner similar to the TranslatorX tool that you mentioned but with some additional features. So if possible I would like to do this without adding another tool dependency. So what I need is a method to detect a frameshift mutation from a pairwise alignment. I've tried calculating a rolling window of edit distances between the alignments and then if the edit distance suddenly increases significantly then I know there is a frameshift. But I'm having difficultly setting a normalized threshold for a significant increase in edit distance that works on most alignments.

ADD REPLY • link 4.2 years ago by nickeener ▴ 60

1

Entering edit mode

What you tend to see if something is out of frame is an overrepresentation of stop codons in the downstream sequences - not always, but it acts as a good flag (especially if you're only expecting one and it's at the end of the string). You can explore this by looking at the first, second, and third translation frame and seeing if it jives with the datasets you're working with. So, perhaps instead of a simple Hamming or Levenshtein distance between the sequences, you may be able to get some resolution by combining a distance-based metric with a sliding window for stop codon presence/absence. Just spitballing here, but maybe it will help :)

ADD REPLY • link 4.2 years ago by Brice Sarver ★ 3.8k

score 1 · Answer 2 · 2020-02-18

I've tried calculating a rolling window of edit distances between the alignments and then if the edit distance suddenly increases significantly then I know there is a frameshift. But I'm having difficultly setting a normalized threshold for a significant increase in edit distance that works on most alignments.

I think you are on a right track here, but I would look at alignments in general sense rather than local increase or decrease in distance.

Here is how I would do it in non-optimized fashion: 1) take two DNA sequences, translate, align and score the alignment; 2) remove one base at a time in sliding window fashion from your non-reference, translate, align and score; 3) do the same as in (2) but remove two bases at a time. If you end up with a better alignment (higher score) in either (2) or (3) compared to (1), that would mean that by removing a base or two you brought the frameshifted sequence back into frame. If (1) is the best score, that would mean there was no frameshift to begin with and doing (2) or (3) actually introduced one.

I am sure this is not a perfect strategy, especially if the frameshift is close to the C-terminus, but it could work with some extra thinking and optimization.