Question

UMAP and "equal" objects

0

Entering edit mode

2.8 years ago

German.M.Demidov ★ 2.9k

I want to plot a very large dataset. UMAP works quite good with this type of data (not single-cell expression but similar). However I have couple of clusters of absolutely equal objects, distance between all these objects is 0 (within each cluster) and UMAP somehow draws these huge clusters as "outlier" dots - even though these objects are not so dissimilar to the other objects.

I can replace these objects with only 1 representative, but are there alternative way to vizualize clusters using UMAP so it is not plotted as a dot very far from other dots?

visualization • 1.4k views

ADD COMMENT • link updated 2.8 years ago by James • 0 • written 2.8 years ago by German.M.Demidov ★ 2.9k

score 1 · Answer 1 · 2021-06-17

1

Entering edit mode

2.8 years ago

Jean-Karim Heriche 27k

You probably need to play with the parameters. Check these papers to get an idea of where you could focus your efforts:

A Unifying Perspective on Neighbor Embeddings along the Attraction-Repulsion Spectrum
Initialization is critical for preserving global data structure in both t-SNE and UMAP

ADD COMMENT • link 2.8 years ago by Jean-Karim Heriche 27k

score 1 · Answer 2 · 2021-06-17

1

Entering edit mode

2.8 years ago

Mensur Dlakic ★ 27k

It depends on your definition of a large dataset. I have used openTSNE with 20-30 CPUs on a 100000 x 136 dataset, and it does the embedding in ~25 minutes. Even though this implementation of t-SNE is not as fast as UMAP, it is fast enough that it should not be a problem to use t-SNE even on datasets with million data points, as long as their second dimension is not in thousands.

I am curious as to how do you define the distance between your vectors to be 0. UMAP is not supposed to separate at all data points that are (near-)identical, no matter what parameters are used.

ADD COMMENT • link 2.8 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thanks! The problem of nearly identical samples is not that they are not separated, but they are placed far far far away from the general distribution - even though they are not very far conceptually! I am fine with them being place as 1 dot, but within the general distribution of data, they are not actual outliers, but since UMAP looks for local similarities - it prefers to "push" these huge clusters of identical objects as far as possible...

ADD REPLY • link 2.8 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

Just for information, openTSNE is from (some of) the same people as the papers I linked to above.

ADD REPLY • link 2.8 years ago by Jean-Karim Heriche 27k

score 0 · Answer 3 · 2021-06-21

0

Entering edit mode

2.8 years ago

James • 0

You might try to use the linear-correlation distance instead of the Euclidean distance: the correlation distance normalizes all vectors to unit vectors.

ADD COMMENT • link 2.8 years ago by James • 0