I'm reading an article with title "characterizing sars cov 2 mutations in the united states" lately and got confused about this Jaccard distance. So, the article is trying to use the Jaccard distance to measure the similarity between SNP variants and compare the difference between the SNP variant profiles of SARS-CoV-2 genomes. Firstly, the Jaccard similarity coefficients is defined as the intersection size divided by the union of the two sets A and B. And then the Jaccard distance of two sets A and B is scored as the difference between one and the Jaccard similarity coefficient and is a metric on the collection of all finite sets. This is easy to understand, as distance complements similarity. But defining distance and similarity in this way, would ignore the order information underlying the sequence structures, right? Would this be sufficient ? I mean, is this a good distance / similarity definition then ?
I think it is a reasonable metric. Just to be sure the details are clear: imagine you want to see how distant corona strain A is from corona strain B. First thing you do is sequence both strains, align them to the reference corona sequence and perform variant calling. Most of the sequence is identical (hence, uninformative); only at certain sites, you observe variants at strain A with respect to the reference and variants at strain B with respect to the reference. This can be represented at two sets: set A contains variants detected in strain A while set B contains variants detected in strain B. The number of variants contained in each set is a measure of divergence with respect to the reference (how distant a strain is compared to the reference). But the question is not that, but rather how distant is set A form set B. Hence, you can employ Jaccard distance to measure the distance between set A and set B: large intersections between set A and B will serve a good evidence for similarity between strain A and B. This strategy makes sense if the sequences are relatively similar between each other; e.g. not too many variants (as it is the case for viral sequences belonging to the same species). It is a bit like measuring the length of a phylogenetic branch from A-to-B passing through the reference.
Does this make sense to you? :)
I think this @Ventrilocus provided a good argument, but i will add some caveats that maybe were not covered.
If memory serves well, Jaccard similarity is normally used to measure the overlap between datasets of the same size. It is used, for example, in comparing a known clustering solution with clusters obtained by various methods. Setting aside for a moment the requirement for the same size, the explanation by @Ventrilocus holds in that Jaccard similarity between the two related strains is most likely very close to 1 (so the distance is close to 0) - if we choose to look at the whole genome. That said, the argument above about the subsets is valid only if the subsets somehow include all present and future mutants we'd be interested in studying, or if we are content to look only at mutations in the originally chosen subset. To me this is a potentially serious drawback.
Finally, the main reason I think Jaccard similarity is not a good measure is because it doesn't account for the effect of silent mutations. Changing
CGT from the reference into
CGC in strain A will lower the Jaccard similarity, even though both of them code for arginine. In that sense, Jaccard similarity/distance measure may be good in accounting for absolute mutation rates, but it would be deficient when it comes to biological consequences of those mutations.