Question: Hierarchical Clustering Using Python
0
6.9 years ago by
12401920
12401920 wrote:

Hi there, new to clustering,don't quiet get the idea in terms of programming. I have a set of files full on non-coding DNA sequences alignments, I found the distance measure for each alignment, they'll be an array. As I've understood now I have to produce the distance matrix (dimensional array) and then perform hierarchical clustering, which would then generate a tree-structured view. The question is how to produce the distance matrix and what are the further steps for successful clustering? Does the matrix have to be similar?

python clustering distance • 5.1k views
modified 6.9 years ago by Ashutosh Pandey12k • written 6.9 years ago by 12401920
2
6.9 years ago by
Ashutosh Pandey12k wrote:

Well what have you described above is the basis of most of the multiple sequence alignment alogrithms such as CLUSTALW. You may use any of these tools to accomplish what you want.

Assuming you have N sequences. You will have to create N x N matrix where each element (cell) will contain the distance between the corresponding sequences. The value of this distance can be calculated by aligning sequences against each other and calculating alignment score or using some other score. Also, it will be a symmetric matrix i.e. distance between seqA and seqB will be same as distance between seqB and seqA. so you only need to compute half of the matrix.

Once you are done with the matrix creation, you can proceed to Hierarchical clustering.

You will have to start with sequences that have the smallest distance between them. You will merge them and will have to come up with a way to create a consensus sequence that represent the two sequences. Then you will have to create the distance matrix again and merge the two sequences with the smallest distance. This will go on until you are finished with the sequences.

I think in your case, using Python to come up with a consensus sequences is a crucial and complicated step.