6.9 years ago by

Philadelphia

Well what have you described above is the basis of most of the multiple sequence alignment alogrithms such as CLUSTALW. You may use any of these tools to accomplish what you want.

Assuming you have N sequences. You will have to create N x N matrix where each element (cell) will contain the distance between the corresponding sequences. The value of this distance can be calculated by aligning sequences against each other and calculating alignment score or using some other score. Also, it will be a symmetric matrix i.e. distance between seqA and seqB will be same as distance between seqB and seqA. so you only need to compute half of the matrix.

Once you are done with the matrix creation, you can proceed to Hierarchical clustering.

You will have to start with sequences that have the smallest distance between them. You will merge them and will have to come up with a way to create a consensus sequence that represent the two sequences. Then you will have to create the distance matrix again and merge the two sequences with the smallest distance. This will go on until you are finished with the sequences.

I think in your case, using Python to come up with a consensus sequences is a crucial and complicated step.