Question

Clustering on Big data (100 million samples)

1

Entering edit mode

3 months ago

Rushikesh ▴ 10

I am working with the huge data, I have 100 files of same kind of data in homogeneous format. Each file containes 10,00,000 smaple. each sample is fingerprint (1024) of compound. I need to cluster compounds, I have to reduce their dimensions from 1024 to 100. I can't reduce Dimensions in batches (using UMAP) and also can't cluster in batches. I can't even concatenate all files in a single files because of RAM constraints. Looking for some better time and memory efficient approach.

What I planned is I will UMAP.fit() on random 10 files from 100 files (as all data is of same kind) which will work on mutiple cores and further using fitted UMAP I will UMAP.transform() on all 100 files to reduce it to n_components = 100 and will keep concatenating it to the single file (which my system can hold in RAM) and further I will Cluster that single file using fast_HDBSCAN (as it works on multi cores). Are there any disadvantages of this approach? is it even a right way to do?

hoping for constructive responses...

Clustering Data Learning Big Machine • 529 views

ADD COMMENT • link updated 3 months ago by Mensur Dlakic ★ 30k • written 3 months ago by Rushikesh ▴ 10

score 1 · Answer 1 · 2025-06-02

The right approach would be to find a server with plenty of memory and do this all at once. That said, you might be able to get away with it.

What are disadvantages? When you pick your initial random collection and do the embedding, you might have a non-representative group of files. If that's the case, some groups will pop up later during UMAP transformation, and their relationship with others may not be mapped properly. If you are lucky and at least one representative of each group is included in the initial group, this could work.