I am very new to Genomics and to the DNA clustering specifically. I have around 100 samples with each having millions of DNA Reads.
I used some sequence clustering method to make clusters for a single sample. But i think, clustring the millions of DNA reads of a single sample does not make any sense, eve though the clusters looks fine.
I want to find a way to cluster the all the 100 samples together instead of doing it sample by sample. For that, somehow, I need to select some reads from each sample to cluster and not all the billions of reads. I need to find some way to intelligently select subset from the reads of each sample (this subset of reads somehow should represent the whole sample). Once I have the subset of reads which is fairly representing the sample, I will be able to cluster all the samples together.
Any thoughts on how to tackle this problem.
One of the the solutions I thought of using PCA ? Any other Genomics specific approach to tackle this problem?