I am trying to generate a Uniform Manifold Approximation and Projection (UMAP) for about 300,000 observations of 36 variables. So far I have been using the umap package in R but this seems prohibitively slow (for exploratory analyses and parameter optimisation). Can someone recommend an alternative faster method?
This does not sound right. I use python UMAP very often on a dataset with 1 million data points and about 300 variables. It takes 1-2 hours on a modern multi-core PC.
If this is something that is dataset-dependent and UMAP is truly slow with your setup, I would suggest that you try it in python before doing anything else. Sure, reducing the number of variables by PCA or by other means will speed things up, but it may take away important signal if PCA recovers less than 100% variance. Also, t-SNE should work fine on a database of this size, though it will take half a day or so.
Hello All, UMAP is slow for millions of data points or more, limited by NN-Descent (single threaded). UWOT is faster for nearest neighbor search but the embedding step is still single-threaded. Check annembed Jean and I developed, we parallelize everything and it can run 11 million data points in just 40 minutes, 24 threads. https://github.com/jean-pierreBoth/annembed