Question

how to understand this "Jaccard distance" ?

1

Entering edit mode

3.5 years ago

2001linana ▴ 40

I'm reading an article with title "characterizing sars cov 2 mutations in the united states" lately and got confused about this Jaccard distance. So, the article is trying to use the Jaccard distance to measure the similarity between SNP variants and compare the difference between the SNP variant profiles of SARS-CoV-2 genomes. Firstly, the Jaccard similarity coefficients is defined as the intersection size divided by the union of the two sets A and B. And then the Jaccard distance of two sets A and B is scored as the difference between one and the Jaccard similarity coefficient and is a metric on the collection of all finite sets. This is easy to understand, as distance complements similarity. But defining distance and similarity in this way, would ignore the order information underlying the sequence structures, right? Would this be sufficient ? I mean, is this a good distance / similarity definition then ?

sequencing sequence • 3.1k views

ADD COMMENT • link updated 2.6 years ago by Mensur Dlakic ★ 27k • written 3.5 years ago by 2001linana ▴ 40

2

Entering edit mode

Similarity measurements are useful generally for painting a quick picture of how divergent a site in a set of sequences are. You are correct that they do not capture all of the useful information about a protein structure however, but whether this really presents a problem or not depends on the question you're asking of the data.

ADD REPLY • link 3.5 years ago by Joe 21k

score 2 · Answer 1 · 2020-11-20

I think it is a reasonable metric. Just to be sure the details are clear: imagine you want to see how distant corona strain A is from corona strain B. First thing you do is sequence both strains, align them to the reference corona sequence and perform variant calling. Most of the sequence is identical (hence, uninformative); only at certain sites, you observe variants at strain A with respect to the reference and variants at strain B with respect to the reference. This can be represented at two sets: set A contains variants detected in strain A while set B contains variants detected in strain B. The number of variants contained in each set is a measure of divergence with respect to the reference (how distant a strain is compared to the reference). But the question is not that, but rather how distant is set A form set B. Hence, you can employ Jaccard distance to measure the distance between set A and set B: large intersections between set A and B will serve a good evidence for similarity between strain A and B. This strategy makes sense if the sequences are relatively similar between each other; e.g. not too many variants (as it is the case for viral sequences belonging to the same species). It is a bit like measuring the length of a phylogenetic branch from A-to-B passing through the reference.

Does this make sense to you? :)

score 1 · Answer 2 · 2020-11-20

1

Entering edit mode

3.5 years ago

Mensur Dlakic ★ 27k

I think this @Ventrilocus provided a good argument, but i will add some caveats that maybe were not covered.

If memory serves well, Jaccard similarity is normally used to measure the overlap between datasets of the same size. It is used, for example, in comparing a known clustering solution with clusters obtained by various methods. Setting aside for a moment the requirement for the same size, the explanation by @Ventrilocus holds in that Jaccard similarity between the two related strains is most likely very close to 1 (so the distance is close to 0) - if we choose to look at the whole genome. That said, the argument above about the subsets is valid only if the subsets somehow include all present and future mutants we'd be interested in studying, or if we are content to look only at mutations in the originally chosen subset. To me this is a potentially serious drawback.

Finally, the main reason I think Jaccard similarity is not a good measure is because it doesn't account for the effect of silent mutations. Changing CGT from the reference into CGC in strain A will lower the Jaccard similarity, even though both of them code for arginine. In that sense, Jaccard similarity/distance measure may be good in accounting for absolute mutation rates, but it would be deficient when it comes to biological consequences of those mutations.

ADD COMMENT • link 3.5 years ago by Mensur Dlakic ★ 27k

1

Entering edit mode

1: Why do we have the "same size datasets" requirement, as I did not see this from the formula definition. 2: So we can choose what we'd like the subsets to be in the definition then? What do you mean "the argument above about the subsets is valid only if the subsets somehow include all present and future mutants we'd be interested in studying..."?

ADD REPLY • link 3.5 years ago by 2001linana ▴ 40

0

Entering edit mode

To echo 2001linana's question below, where is this requirement for the same size dataset coming from?

ADD REPLY • link 2.6 years ago by lauren • 0

0

Entering edit mode

I know for sure that jaccard function in sklearn requires that the sets for comparison be of equal size. I don't know off-hand the mathematical reason for that, but intuitively I think they have to be of equal size or else one could always find many different unions between a smaller vector and a particular fraction of a larger vector. How would we decide which of those unions to report, especially since many of them would likely have the same Jaccard score?

ADD REPLY • link 2.6 years ago by Mensur Dlakic ★ 27k