Question

Sequence identity between two protein sequences

0

Entering edit mode

2.2 years ago

Gene_MMP8 ▴ 240

I have recently come across the term protein sequence identity, which is defined as the ratio between the number of matches between two amino acid sequences and the length of the alignment.
I was thinking of a hypothetical situation where I download the amino acid sequence of a particular protein from two different databases. The first database gives the length as X and the second database as Y (X<Y). But the identity between the two sequences is 100%. Is this possible given the way we calculate identity as described above?

Identity Protein • 3.6k views

ADD COMMENT • link updated 2.2 years ago by Andrzej Zielezinski 11k • written 2.2 years ago by Gene_MMP8 ▴ 240

score 3 · Accepted Answer · 2022-09-12

3

Entering edit mode

2.2 years ago

Andrzej Zielezinski 11k

I know it sounds weird, but two sequences with different lengths may have 100% identity. The result depends on whether their alignment is global or local. Global alignment aligns two sequences across their entire length, from beginning to end. Local alignment finds the region with the highest level of similarity between the two sequences. So if two sequences have different lengths, they can be 100% identical in the local alignment but not in the global alignment.

For example, the global and local alignments of two sequences: MKSTVGHSTR and MKSTVG:

Global alignment           Local alignment

MKSTVGHSTR                 MKSTVG
||||||                     ||||||
MKSTVG----                 MKSTVG

Identity: 60%              Identity: 100%

ADD COMMENT • link 2.2 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Thanks for your prompt response. What if the formula that we are using for identity calculation is slightly different for the global alignment (gap excluded identity). If we take identity as (number of matches)/(min(length(A, B)), then for the first case identity is 100% as well. Is that a possibility?

ADD REPLY • link 2.2 years ago by Gene_MMP8 ▴ 240

1

Entering edit mode

This formula is a modified version of the Jaccard Index, and it is common in alignment-free sequence comparisons (e.g., this formula is implemented in Mash or Kmer-db). However, for sequence alignment, sequence identity - as you said - is the ratio of identical matches between two sequences to the total length of the alignment (including gaps). But it doesn't mean you can't use your own formula - in some specific applications, it is sometimes necessary to introduce an additional measure of similarity. In addition to sequence identity, you may also want to look at the alignment score and query coverage. The combination of these three measures - sequence identity, score, and query coverage - is usually enough to characterize the level of similarity between two sequences.

ADD REPLY • link 2.2 years ago by Andrzej Zielezinski 11k