Question: Clustering DNA sequences (from human genome)
gravatar for abdul.karim
11 months ago by
abdul.karim0 wrote:


I am very new to Genomics and to the DNA clustering specifically. I have around 100 samples with each having millions of DNA Reads.

I used some sequence clustering method to make clusters for a single sample. But i think, clustring the millions of DNA reads of a single sample does not make any sense, eve though the clusters looks fine.

I want to find a way to cluster the all the 100 samples together instead of doing it sample by sample. For that, somehow, I need to select some reads from each sample to cluster and not all the billions of reads. I need to find some way to intelligently select subset from the reads of each sample (this subset of reads somehow should represent the whole sample). Once I have the subset of reads which is fairly representing the sample, I will be able to cluster all the samples together.

Any thoughts on how to tackle this problem.

One of the the solutions I thought of using PCA ? Any other Genomics specific approach to tackle this problem?

sequencing genome • 314 views
ADD COMMENTlink modified 10 months ago • written 11 months ago by abdul.karim0

Try to map the reads first to the genome. Then focus on differences (mutations or variants), and cluster based on these differences.

ADD REPLYlink written 11 months ago by Benn8.0k

let's say I map my reads from two samples to a reference genome (either exact mapping or approximate mapping). If we assume that 60% of my reads from one sample (Sample-A) maps to the reference while 40% do not match. Now do you suggest that I take these 40% unmatched reads? I do the same for Sample-B as well and take the unmatched reads only. Then I cluster unmatched from both the samples? Please correct me if my understanding it wrong?

ADD REPLYlink written 10 months ago by abdul.karim0

My suggestion is to map your reads to a reference genome, and then call the variants. Use those variants for further clustering analysis. I am talking about human genome sequencing, but I am not sure if you are talking about that too. Please explain more in your question about what kind of data you are working with, and what your research question is.

ADD REPLYlink modified 10 months ago • written 10 months ago by Benn8.0k

Yes, I am working on human genome (as a reference). And I have DNA reads from let's say two samples (two different individuals) and my goal is to cluster those DNA reads from two samples. Ideally, I should get two clusters using kmeans or any other algorithm. The first challenge is that reads are mostly (99%) similar in all human beings. The second challenge is that number of reads for each individual is huge.

ADD REPLYlink written 10 months ago by abdul.karim0

It doesn't make sense to cluster your raw reads. Why don't you want to focus on the variants like I suggested (and like the rest of the world is using?). Please explain why?

ADD REPLYlink written 10 months ago by Benn8.0k

If you did whole genome sequencing I think the approach that Benn is suggesting is the best way. Map the reads against a reference, call the SNPs and compare the SNPs. And I am not sure but making a subsample in this case is not needed (or recommended?).

If you did amplicon sequencing, so only sequenced the genes of interest with a specific primer maybe you can check out this page: This method can also be done by VSEARCH. It is originally created to find OTUs but maybe it can help you to.

If you have a specific goal you may need to create a pipeline yourself using existing cluster tools.

should get two clusters using kmeans or any other algorithm

Because of this I assume you did amplicon sequencing. If you cluster reads coming from the whole genome you will get many many clusters.

ADD REPLYlink written 10 months ago by gb1.9k

Making a sub sample is also called rarifying. There are many discussions about this and you can probably find enough tools to do it.

To explain what will be the best method for clustering we need more information about the goal. But for example OTU clustering with USEARCH and VSEARCH you can add a "sample identifier" to the reads. Then cluster and afterwards you can seperate the clusters per sample again with the help of an otu table.

ADD REPLYlink written 11 months ago by gb1.9k

also, PCA is a dimention reduction method, there are other methods specific for clustering (Partitioning clustering, Hierarchical clustering, etc. ) with other helping algorithms to aid choosing the right one and right number of clusters (Hopkins statistic, Elbow method, etc). As @Benn already indicated, focusing the analysis on subset of genes that carry related biological meaning to your study, would be the way for it. hth

ADD REPLYlink written 11 months ago by H.Hasani920
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1361 users visited in the last hour