fastest UMAP method
4
0
Entering edit mode
18 months ago
Grey Monroe ▴ 10

I am trying to generate a Uniform Manifold Approximation and Projection (UMAP) for about 300,000 observations of 36 variables. So far I have been using the umap package in R but this seems prohibitively slow (for exploratory analyses and parameter optimisation). Can someone recommend an alternative faster method? Thank you!

UMAP R • 2.3k views
1
Entering edit mode

I've used umap in R for datasets of >3 million cells. It takes a few hours on a cluster environment.

1
Entering edit mode

@KevinBlighe could you give a few notes on how your R setup distributes the compute load over several nodes / cores?

To the best of my knowledge, UMAP still only runs on a single host: https://github.com/lmcinnes/umap/issues/37

There are streaming single-GPU implementations of tSNE like https://github.com/ZJULearning/AtSNE but even those when running on recent NVIDIA chips like Titan RTX's take several days if one has an input matrix with 3 millions rows in 50 dimensions.

Any advice you can offer on how you run the reduction of a dataset with 3 million rows in a few hours would be hugely valuable. How many dimensions do your points have?

1
Entering edit mode

Hey Douglas, thank you for inquiring. There was nothing overtly special about it... I just installed umap in R on a HPC at King's College London and ran it overnight via a PBS script specifying a considerable amount of RAM and 2 cores. I don't know the system architecture used, but guess that it was CentOS. I think that the biggest problem facing some people may be patience...?

I need to explore the other UMAP implementations mentioned in this thread on Biostars, though. It is becoming a limiting step for 'Desktop' analyis workflows that need to use UMAP.

1
Entering edit mode

Any advice you can offer on how you run the reduction of a dataset with 3 million rows in a few hours would be hugely valuable.

Since you are asking for any advice, here is one: consider using UMAP under python. I used to run UMAP very frequently on a dataset that is approximately 300 x 2 million. With 8 CPUs, that takes 3-4 days. Your dataset will presumably take less since the overall matrix size is roughly 4 times smaller. I am not saying it will run in a few hours as you'd like, but that is not very realistic for a dataset with millions of data points no matter what embedding method you are using, and no matter your computational power.

1
Entering edit mode

Another thing to try, if speed is your main concern, is LargeVis. That will run on the same 300 x 2 million dataset in 5-6 hours (Linux executable with 8 CPUs). To my eye the UMAP plots are cleaner and more aesthetically pleasing, but LargeVis will be ahead by an order of magnitude in terms of speed.

The algorithm is described here.

1
Entering edit mode

By the way, what do you mean exactly by prohibitively slow? Many hours? Days?

0
Entering edit mode

Good question. Prohibitive only for exploratory analyses (hours). I'd like to see how the output changes with adjustments to parameters like number of neighbors.

1
Entering edit mode

I suspected that our definitions of prohibitively slow may be different. A common set of parameters to test is described here. If you go with 6 n_neighbors values (5, 10, 20, 50, 100, 200) and 4 min_dist values (0.25, 0.5, 0.8, 0.99), that should be doable in under a week.

Honestly, I don't see a problem unless you are on some kind of a deadline. Even if you want to do 50-100 trials that each take few hours, what is couple of weeks to find out what works. Besides, UMAP doesn't do any clustering - it only provides low-dimensional visualization from which clusters are often derived automatically and manually. That means that most embeddings will show cluster structure if there is one, so hunting for a set of parameters that gives the absolute best visualization may not be necessary.

I will repeat: t-SNE will do a decent job on this dataset as well, and it really has only perplexity that may need tuning. I'd be surprised if you need to do more than 3 different perplexity values (say, 20, 30 and 50), and that will also run in no more than 3-4 days for a dataset of this size.

0
Entering edit mode

Thank you for the great suggestions!

0
Entering edit mode

Any idea why (where df = 30000row*36col dataframe)

umap(df, n_components= 2, n_neighbors=200, min_dist=0.25)


would produce

Error: vector memory exhausted (limit reached?)


Lower n_neighbors values do not yield this error.

1
Entering edit mode

You're running out of memory. Lower n_neighbors values require less memory. I personally feel that 200 is much more than necessary, I don't think I've ever seen anyone use more than 50, as you start to lose a lot of finer-grained local structure after that point. This is particularly the case if you expect some of your cell types of interest to be quite closely related.

0
Entering edit mode

You can also subset your data for exploratory analysis. I would argue that after 50,000 points, you won't really be able to see any differences anyway. There are only so many dots you can put on a plot before they simply start overlapping the previous dots.

1
Entering edit mode

This is a valid approach ONLY if the subset is representative of the whole dataset. That is, if it is drawn from the same data distribution. Drawing the subset randomly may or may not fulfill this criterion - that's the nature of random sampling. It comes down to how important it is to do this right vs. time constraints.

0
Entering edit mode

To clarify, the suggestion was to experiment with different parameters using a small subset. Then apply the optimal settings to the full dataset. That will answer the question of how representative the subset was.

I personally would be curious to see an example using real biological data with over 100k samples where a 10% random subset generates a substantially different UMAP representation.

1
Entering edit mode

If you have a CUDA capable GPU available, you can try gpumap.

4
Entering edit mode
18 months ago
igor 12k

If speed is a concern, you can perform PCA prior to UMAP, reducing your input matrix. Many single-cell packages do that by default, so it's a fairly common practice. You can read previous discussion here: Why do PCA+tSNE/viSNE or PCA+UMAP, and not just tSNE/viSNE or UMAP on their own?

To speed up the PCA computation, you can use the irlba package. The uwot package does that internally (if pca is specified).

4
Entering edit mode
18 months ago
Mensur Dlakic ★ 11k

This does not sound right. I use python UMAP very often on a dataset with 1 million data points and about 300 variables. It takes 1-2 hours on a modern multi-core PC.

If this is something that is dataset-dependent and UMAP is truly slow with your setup, I would suggest that you try it in python before doing anything else. Sure, reducing the number of variables by PCA or by other means will speed things up, but it may take away important signal if PCA recovers less than 100% variance. Also, t-SNE should work fine on a database of this size, though it will take half a day or so.

3
Entering edit mode
18 months ago

Try uwot, it utilizes C++ and should be at least somewhat faster.

1
Entering edit mode
18 months ago

I am using sample <- RunUMAP(sample, umap.method = "umap-learn") which might require installation of umap-learn and some dependencies in Python.