Sequence identity between sequences with different lengths
2.2 years ago

A simple question. What is the sequence identity between 2 sequences when one is much larger than the other?

seq1:  -------------------AGTGTGAAAAAGGT----------------
seq2:  ATATATGCGCATGGTAATAAGTGTGAAAAAGGTTATATGCGCATAAGGT


The smaller sequence corresponds 100% to a subset of the bigger one. Do they have 100% identity? Or rather something like 30%, as seq1 corresponds to 30% of seq2?

The reason why I ask this is that I am filtering an alignment of two assemblies of the same genome (with nucmer/mumer) and I can filter out aligned contigs based on identity.

Would have say that, if you look at seq1 it has 100% identity on 100% of its length, if you look at seq2 it has 100% identity on 30% of its length, it's a point a view

I would say seq1 is 100% identical to seq2, while seq2 is only 30% identical to seq1 .

unfortunately heavily depending on how you look at this

Great, that's it, thanks! It depends on what is the query and what is the reference. Thanks! (If you write it as an answer instead of a comment I'll accept it)

It also depends on whether you use global or local alignment.