Question: Clustering in Perl
 
2
 
 

I'm currently using Algorithm::Cluster, which is based on the C Clustering Library, to cluster sequences and structures in Perl. Algorithm::Cluster provides many clustering facilities, including hierarchical clustering. Given the desired number of clusters, it builds the tree and cuts it. What I need, however, is a library that allows for a threshold. Something like: all of the members of one cluster are <= X distance apart, or: any two members of different clusters are >= X distance apart.

Is this possible in Algorithm::Cluster? Or is there another (Perl) module that would, given a distance matrix and a threshold, determine the appropriate number of clusters and their members?

 
 

2 answers

 
4
 
 

I've submitted a patch against Algorithm::Cluster to allow:

my $cluster_ids = $tree->cutthresh(3.75);

The patch adds an XS interface (i.e. the code is in C). You can find it on this bug report:

https://rt.cpan.org/Public/Bug/Display.html?id=68482

Those interested in a quick Pure Perl solution can use this example which uses some undocumented XS interfaces:

sub cutthresh {
my ($tree, $thresh) = @_;   
my @nodecluster;
my @leafcluster;
# Binary tree: number of internal nodes is 1 less than # of leafs
# Last node is the root, walking down the tree
my $icluster = 0;
# Root node belongs to cluster 0
$nodecluster[@doms-2] = $icluster++;
for (my $i = @doms-2; $i >= 0; $i--) {        
    my $node = $tree->get($i);
    say sprintf "%3d %3d %.3f", $i,$nodecluster[$i], $node->distance;
    my $left = $node->left;
    # Nodes are numbered -1,-2,... Leafs are numbered 0,1,2,...
    my $leftref = $left < 0 ? \$nodecluster[-$left-1] : \$leafcluster[$left];
    my $assigncluster = $nodecluster[$i];
    # Left is always the same as the parent node's cluster
    $$leftref = $assigncluster;
    say sprintf "\tleft  %3d %3d", $left, $$leftref;
    my $right = $node->right;
    # Put right into a new cluster, when thresh not satisfied
    if ($node->distance > $thresh) { $assigncluster = $icluster++ }
    my $rightref = $right < 0 ? \$nodecluster[-$right-1] : \$leafcluster[$right];
    $$rightref = $assigncluster;
    say sprintf "\tright %3d %3d", $right, $$rightref;
}
return @leafcluster;
}
 
 
 

The pure Perl version of this has now been implemented as http://p3rl.org/Algorithm::Cluster::Thresh for those who are interested.

log in to reply • written 10 months ago by Chad A. Davis  103
 
 
1
 
 

Do you have to use Perl? I am a huge fan of Perl but for these sort of tasks I would use R. I have used this website to learn clustering in R. You can still use Perl to connect with R if you have to. I played with RSPerl for a while but in the end it was easier for me just to use R scripts.

 
 
 
Log in to add a post