Question

What is a suitable metric to compute a cell-to-cell distance matrix

0

Entering edit mode

2.2 years ago

Gabriel ▴ 150

In Seurat and Scran I have noted they use the SNN or KNN algorithm to find the nearest neighbor of cells to one another for clustering, data integration etc. makeSNNGraph

I have seen mentions of Euclidean, Jaccard, and rank applied on the ranks themselves (of the nearest neighborhood) but how were the ranks themselves calculated and what are distance metrics that are better suited for sc Data ?

I have seen it calculated from the reduced dimensions, and then simply applying some distance metric on the PCA scores:

F.ex.

mat = reducedDim(sce, "PCA")
distance = dist(mat, method = "euclidean")

seurat KNN scran SNN adjacency • 872 views

ADD COMMENT • link 2.2 years ago by Gabriel ▴ 150

score 2 · Answer 1 · 2022-02-01

I don't think there's a right answer. A distance measure implies a certain notion of similarity between the cells and which notion of similarity is relevant can vary with the context. In addition some measures may have properties that may make them more or less suitable to certain contexts. For example many distance measures suffer from the concentration phenomenon by which, in noisy high dimensional spaces, the distance measure tends towards a constant with differences in observed values being essentially random. This can render nearest neighbours in such spaces meaningless. This is one of the reasons for applying dimensionality reduction methods. However, some dimensionality reduction methods produce an outcome in which some distances don't preserve the same notion of similarity as in the original space. Choosing a distance measure is also subject to some technical considerations such as whether the data is discrete, normalized, skewed, has outliers... To judge if the distance measure is suitable you need external information, i.e. do cells already known to be similar (for whatever notion of similarity you are interested in) have a short distance? Or how would you evaluate the outcome if you get two different clustering outcomes using two different measures? On the other hand when there is a strong structure in the data, multiple, if not most, approaches should find it.