wrote:

I'm reading an article with title "characterizing sars cov 2 mutations in the united states" lately and got confused about this Jaccard distance. So, the article is trying to use the Jaccard distance to measure the similarity between SNP variants and compare the difference between the SNP variant profiles of SARS-CoV-2 genomes. Firstly, the Jaccard similarity coefficients is defined as the intersection size divided by the union of the two sets A and B. And then the Jaccard distance of two sets A and B is scored as the difference between one and the Jaccard similarity coefficient and is a metric on the collection of all finite sets. This is easy to understand, as distance complements similarity. But defining distance and similarity in this way, would ignore the order information underlying the sequence structures, right? Would this be sufficient ? I mean, is this a good distance / similarity definition then ?

written 11 days ago by 2001linana

**20**

Similarity measurements are useful generally for painting a quick picture of how divergent a site in a set of sequences are. You are correct that they do not capture all of the useful information about a protein structure however, but whether this really presents a problem or not depends on the question you're asking of the data.

18k