Question

Can we use UMAP clustering on bulk data?

0

Entering edit mode

18 months ago

Info.shi ▴ 30

Hi everyone,

I have transcriptomic bulk data. I have 10k genes and 5 replicates. I want to see the clustering pattern between replicates using UMAP to represent each replicate as a point but it shows Error: umap: number of neighbors must be smaller than the number of items after this

iris.umap <- umap(matrix)

Replicates  MSTRG.714.1   MSTRG.9848.1  MSTRG.8579.1  MSTRG.2154.1  MSTRG.434.1    ................. 
Rep_1       12.1871378    4.648702047   0.125640596   2.512811917   5.905108005    .................
Rep_2       8.69549926    5.864406477   0.101110457   1.213325478   4.246639173    .................
Rep_3       10.3490802    4.704127361   0.188165094   0.376330189   4.327797173    .................
Rep_4       9.803265483   0.710381557   0.284152623   1.420763113   5.967205076    .................
Rep_5       24.94352535   1.950890251   0.139349304   0.975445125   2.508287465    .................

Kindly suggest me.

R UMAP • 1.9k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 18 months ago by Info.shi ▴ 30

0

Entering edit mode

If you have a distance matrix of some sort, I recommend that you try affinity propagation..

ADD REPLY • link 18 months ago by 5heikki 11k

score 2 · Answer 1 · 2022-09-30

2

Entering edit mode

18 months ago

dariober 14k

I think you need:

umap(matrix, n_neighbors=n)

with n less than the number of samples (i.e. < 5).

I'm not 100% sure of what I'm going to say here: In principle, I don't think it is wrong to apply umap on a few samples. But I think PCA would be preferable since it gives distances between datapoints that are less distorted than with umap and with 5 samples there cannot be many clusters you can possibly identify anyway. I mean, if you have many samples and many clusters, umap is likely to separate those clusters better than PCA but the price to pay is that distances are not easy to interpret. Since with 5 samples you cannot have many clusters, better to stay with PCA.

ADD COMMENT • link 18 months ago by dariober 14k

1

Entering edit mode

n less than the number of samples

Technically yes, but should be smaller than the number of samples. The number of neighbors is the size of the local neighborhood, so if equal to all samples, you are essentially assuming a single neighborhood. In the umap-learn tutorial, they use a range of 2 to a quarter of the total sample size as examples of extremely low and extremely high settings.

ADD REPLY • link 18 months ago by igor 13k