I am working with the huge data, I have 100 files of same kind of data in homogeneous format. Each file containes 10,00,000 smaple. each sample is fingerprint (1024) of compound. I need to cluster compounds, I have to reduce their dimensions from 1024 to 100. I can't reduce Dimensions in batches (using UMAP) and also can't cluster in batches. I can't even concatenate all files in a single files because of RAM constraints. Looking for some better time and memory efficient approach.
What I planned is I will UMAP.fit() on random 10 files from 100 files (as all data is of same kind) which will work on mutiple cores and further using fitted UMAP I will UMAP.transform() on all 100 files to reduce it to n_components = 100 and will keep concatenating it to the single file (which my system can hold in RAM) and further I will Cluster that single file using fast_HDBSCAN (as it works on multi cores). Are there any disadvantages of this approach? is it even a right way to do?
hoping for constructive responses...