Question

Hierarchical Clustering Using Python

0

Entering edit mode

11.0 years ago

1240192 • 0

Hi there, new to clustering,don't quiet get the idea in terms of programming. I have a set of files full on non-coding DNA sequences alignments, I found the distance measure for each alignment, they'll be an array. As I've understood now I have to produce the distance matrix (dimensional array) and then perform hierarchical clustering, which would then generate a tree-structured view. The question is how to produce the distance matrix and what are the further steps for successful clustering? Does the matrix have to be similar?

Thanks in advance.

python clustering distance • 6.7k views

ADD COMMENT • link updated 11.0 years ago by Ashutosh Pandey 12k • written 11.0 years ago by 1240192 • 0

score 2 · Answer 1 · 2013-04-19

Well what have you described above is the basis of most of the multiple sequence alignment alogrithms such as CLUSTALW. You may use any of these tools to accomplish what you want.

Assuming you have N sequences. You will have to create N x N matrix where each element (cell) will contain the distance between the corresponding sequences. The value of this distance can be calculated by aligning sequences against each other and calculating alignment score or using some other score. Also, it will be a symmetric matrix i.e. distance between seqA and seqB will be same as distance between seqB and seqA. so you only need to compute half of the matrix.

Once you are done with the matrix creation, you can proceed to Hierarchical clustering.

You will have to start with sequences that have the smallest distance between them. You will merge them and will have to come up with a way to create a consensus sequence that represent the two sequences. Then you will have to create the distance matrix again and merge the two sequences with the smallest distance. This will go on until you are finished with the sequences.

I think in your case, using Python to come up with a consensus sequences is a crucial and complicated step.