Question

Automatic Clustering Of Biological Sequences

0

Entering edit mode

13.0 years ago

Anjan ▴ 830

K-means requires one to specify the number of clusters; clustering based on HCL requires the user to visually inspect a tree. Have you successfully used an automatic clustering algorithm that provides an optimal partitioning of clustered data? Thanks, Anjan

clustering sequence • 3.6k views

ADD COMMENT • link updated 13.0 years ago by Steve Lianoglou 5.2k • written 13.0 years ago by Anjan ▴ 830

4

Entering edit mode

Looked here?

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 13.0 years ago by Michael 54k

score 3 · Answer 1 · 2011-04-06

3

Entering edit mode

13.0 years ago

Jan Kosinski ★ 1.6k

I doubt that you can find any fully automatic clustering method that will always "optimally" partition your sequences - it so much depends on 1) the biological question you want to answer with your clustering, 2) the data itself.

For clustering protein sequences I always used http://www.eb.tuebingen.mpg.de/departments/1-protein-evolution/software/clans

It produces 3D graph layout of sequence space based on which I divided the sequences into clusters manually. In this program it's relatively do analyze the structure of sequence space, do all the selections etc., and find out which clusters the outliers belong to.

ADD COMMENT • link 13.0 years ago by Jan Kosinski ★ 1.6k

1

Entering edit mode

Minor comment - Tancred moved down under few years ago and updated version of CLANS is available its new homepage: http://bioinfoserver.rsbs.anu.edu.au/programs/clans/

ADD REPLY • link 13.0 years ago by Pawel Szczesny 3.2k

0

Entering edit mode

Thanks for the link. I was considering a multidimensional scaling approach to cluster a group of aligned sequences in 2/3-D space. In the second step I would run a Directed Evolution (DE) based clustering algorithm that would find the optimal partition. DE has been successfully used in automated clustering of diverse data and also image segmentation.

ADD REPLY • link 13.0 years ago by Anjan ▴ 830

score 2 · Answer 2 · 2011-04-06

2

Entering edit mode

13.0 years ago

Steve Lianoglou 5.2k

One method that I haven't seen mentioned here (or other linked articles) is affinity propagation. There is a video on the page that shows you how it works on a small dataset, and is (as far as I know) the only clustering-type algorithm that was the focus of an entire publication in Science(!).

ADD COMMENT • link 13.0 years ago by Steve Lianoglou 5.2k

0

Entering edit mode

I have tried affinity propagation in the scikit-learn package, and it looks pretty good. Basically, one feeds in an affinity matrix (calculated from a phylogenetic distance matrix by taking (1 - distance)), or euclidean coordinates that are derived from an affinity or distance matrix. If your labels are in your data as well, you can easily feed them in and find out who clusters with who. Sample code (minus the data I'm working with) available upon request!

ADD REPLY • link 11.0 years ago by ericmajinglong ▴ 120

score 1 · Answer 3 · 2011-04-06

1

Entering edit mode

13.0 years ago

hadasa ★ 1.0k

Have a look at MCL graph clustering. http://micans.org/mcl/

ADD COMMENT • link 13.0 years ago by hadasa ★ 1.0k