I have recently come across the term protein sequence identity, which is defined as the ratio between the number of matches between two amino acid sequences and the length of the alignment.
I was thinking of a hypothetical situation where I download the amino acid sequence of a particular protein from two different databases. The first database gives the length as X and the second database as Y (X<Y). But the identity between the two sequences is 100%. Is this possible given the way we calculate identity as described above?
Thanks for your prompt response. What if the formula that we are using for identity calculation is slightly different for the global alignment (gap excluded identity). If we take identity as (number of matches)/(min(length(A, B)), then for the first case identity is 100% as well. Is that a possibility?
This formula is a modified version of the Jaccard Index, and it is common in alignment-free sequence comparisons (e.g., this formula is implemented in Mash or Kmer-db). However, for sequence alignment, sequence identity - as you said - is the ratio of identical matches between two sequences to the total length of the alignment (including gaps). But it doesn't mean you can't use your own formula - in some specific applications, it is sometimes necessary to introduce an additional measure of similarity. In addition to sequence identity, you may also want to look at the alignment score and query coverage. The combination of these three measures - sequence identity, score, and query coverage - is usually enough to characterize the level of similarity between two sequences.