Question

Using sequence percent similarity from alignments to compare duplication events

0

Entering edit mode

6 months ago

Zeng Hao ▴ 40

Hi everyone,

I am trying to compare sequences in a pairwise fashion to obtain percent similarity scores. For context, these sequences are non-coding DNA that have been duplicated as a result of tandem duplication events. The hypothesis is that by looking at the sequence similarity of these DNA, I can infer the relative recency of these duplications in relation to other duplications as they are not under selection pressure.

For unrelated sequences, I expect a percent similarity score of ~25% (random matches) and increasingly higher scores for sequences that have duplicated recently.

To do this, I have tried to use both global and local pairwise alignment algorithms (i.e., Needleman-Wunsch and Smith-Waterman). The global alignment algorithm is not appropriate for my dataset as I have sequences of different lengths, which would reduce similarity scores significantly due to end gaps. As for local alignments, they are optimal but produce inflated scores (~40%) even for unrelated sequences. For unrelated sequences of similar lengths, I have achieved the expected similarity scores (25-35%) by maximizing end gap penalties using Needleman-Wunsch.

Question: Is my approach sound and appropriate? Is pairwise alignment the way to go about doing this comparison, or am I completely relying on a wrong method? A sequence dot plot alignment is probably closest to providing the information I want, though it does not provide a numerical value (percent identity) that I can use to compare between pairs of sequences immediately.

Thank you very much for reading and please let me know if I can provide any additional details to clarify the question.

gene-duplication sequence-alignment • 383 views

ADD COMMENT • link updated 6 months ago by shelkmike ★ 1.2k • written 6 months ago by Zeng Hao ▴ 40

score 2 · Accepted Answer · 2023-10-11

I'm not a specialist in such analyses. However, I think it's probably worth performing a global alignment and then discarding all columns with gaps before you calculate the percent similarity. Removing gaps may help you in two ways:
1) It solves the problem of end gaps.
2) It may make the estimation of the divergence time more accurate because it accounts for different rates of indels and substitutions. For example, imagine a 200 bp long indel and 200 point substitutions. Both the indel and the substitutions have the same contribution to the percent similarity. However, I doubt that 200 bp long indels occur at the same frequency as 200 separate substitutions.