Automatic Clustering Of Biological Sequences
3
0
Entering edit mode
13.0 years ago
Anjan ▴ 830

K-means requires one to specify the number of clusters; clustering based on HCL requires the user to visually inspect a tree. Have you successfully used an automatic clustering algorithm that provides an optimal partitioning of clustered data? Thanks, Anjan

clustering sequence • 3.6k views
ADD COMMENT
4
Entering edit mode

Looked here?

ADD REPLY
3
Entering edit mode
13.0 years ago
Jan Kosinski ★ 1.6k

I doubt that you can find any fully automatic clustering method that will always "optimally" partition your sequences - it so much depends on 1) the biological question you want to answer with your clustering, 2) the data itself.

For clustering protein sequences I always used http://www.eb.tuebingen.mpg.de/departments/1-protein-evolution/software/clans

It produces 3D graph layout of sequence space based on which I divided the sequences into clusters manually. In this program it's relatively do analyze the structure of sequence space, do all the selections etc., and find out which clusters the outliers belong to.

ADD COMMENT
1
Entering edit mode

Minor comment - Tancred moved down under few years ago and updated version of CLANS is available its new homepage: http://bioinfoserver.rsbs.anu.edu.au/programs/clans/

ADD REPLY
0
Entering edit mode

Thanks for the link. I was considering a multidimensional scaling approach to cluster a group of aligned sequences in 2/3-D space. In the second step I would run a Directed Evolution (DE) based clustering algorithm that would find the optimal partition. DE has been successfully used in automated clustering of diverse data and also image segmentation.

ADD REPLY
2
Entering edit mode
13.0 years ago

One method that I haven't seen mentioned here (or other linked articles) is affinity propagation. There is a video on the page that shows you how it works on a small dataset, and is (as far as I know) the only clustering-type algorithm that was the focus of an entire publication in Science(!).

ADD COMMENT
0
Entering edit mode

I have tried affinity propagation in the scikit-learn package, and it looks pretty good. Basically, one feeds in an affinity matrix (calculated from a phylogenetic distance matrix by taking (1 - distance)), or euclidean coordinates that are derived from an affinity or distance matrix. If your labels are in your data as well, you can easily feed them in and find out who clusters with who. Sample code (minus the data I'm working with) available upon request!

ADD REPLY
1
Entering edit mode
13.0 years ago
hadasa ★ 1.0k

Have a look at MCL graph clustering. http://micans.org/mcl/

ADD COMMENT

Login before adding your answer.

Traffic: 1663 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6