Clustering Of Protein Sequences
2
4
Entering edit mode
12.6 years ago
Mkl ▴ 50

Hi,

How to select representative sequences from a cluster? I did hierarchial clustering and I got 30 clusters. Next step is to select representative sequences from each cluster.Is there any criteria to select representative sequences from clusters?

I tried the following code to get medoid from each cluster.

mydist = as.dist(mat)


clusters= cutree(hclust(mydist),h=21)
mydist = as.matrix(mydist)
clust.medoid = function(i, distmat, clusters) {
ind = (clusters == i)
## edit by md, check it out: ##
if (sum(ind) <= 1)
   return (rownames(mydist)[ind]) ## medoid of a single object is the object
else
   return(names(which.min(rowSums( distmat[ind, ind] ))))
## end edit ##
}


sapply(unique(clusters), clust.medoid, mydist, clusters)

I got the following error when I use sapply function

Error in rowSums(distmat[ind, ind]) : 


'x' must be an array of at least two dimensions

How to solve this error?

clustering • 7.0k views
ADD COMMENT
1
Entering edit mode

The best approach really depends on what the sequence clusters represent and what you are trying to accomplish. If you are just clustering to reduce sequence space and select representative sequences (like say Uniprot clusters) that is very different than if you are clustering on protein families...

ADD REPLY
0
Entering edit mode

btw, why don't you use a standard MSA approach?

ADD REPLY
0
Entering edit mode

Regarding your error: I think you have a cluster with only one member, then this function fails. I will edit your code, so it should work, have a look at the edits.

ADD REPLY
0
Entering edit mode

@Michael. Your code worked well. Thank you very much.

ADD REPLY
9
Entering edit mode
12.6 years ago
Michael 54k

"Representative sequences from a cluster", this notion seems to be adressing the medoid of a cluster. A medoid is similar to a centroid but in contrast to a centroid it is always a member of the cluster, while a centroid normally is not member of the data-set. A cluster medoid can be computed from the distance matrix alone, it is the cluster member with minimum average pairwise distance to all other members. A good definition is here: http://www.unesco.org/webworld/idams/advguide/Chapt7_1_1.htm Don't be confused, you do not need to run a k-medoids clustering. And there is ofc only one medoid per cluster (ignoring ties in distances)

Edit to address your comment: There is no package needed for this. The steps in R are the following: For each cluster:

  • get the list of objects clustered in cluster C_i
  • get distance matrix D and restrict to a distance matrix of only containing objects in C_i
  • convert D into a full matrix (in R distance matrix is a lower triangular)
  • find the row/column with minimal rowsum/colsum, that is the medoid of C_i that is the object with minimum overall distance to all other objects in the cluster.

In the following example I am using the USArrests sample dataset, so your sequences are named like states of the USA, don't worry.

A simple one-line example:

> which.min (rowSums(as.matrix(dist(USArrests))))
Virginia 
  46

shows that "Virginia" [at index 46 in the data ] is the medoid of the whole data set. A bit more complicated example, for all the clusters:

mydist = dist(USArrests)
clusters = cutree(hclust(mydist), k=5) # get 5 clusters
mydist = as.matrix(mydist) # get a full matrix

# function to find medoid in cluster i
clust.medoid = function(i, distmat, clusters) {
    ind = (clusters == i)

    names(which.min(rowSums( distmat[ind, ind] )))
    # c(min(rowMeans( distmat[ind, ind] )))
}


sapply(unique(clusters), clust.medoid, mydist, clusters)
[1] "Michigan"      "Missouri"      "Kansas"        "Florida"       "New Hampshire"

Showing that Michigan, ..., are the medoids in the 5 clusters.

Bear in mind that this is probably the most accurate way of getting a representative sequence which is contained in the data; however, the consensus sequence of a multiple sequence alignment might be a more accurate representative of all sequences, though it is not a member of the data itself.

ADD COMMENT
1
Entering edit mode

You cannot i believe, in this case! To compute a centroid you need numeric input data, e.g in euclidean space you can simply take the arithmetic mean. But here, you input data are sequences from which you computed a distance matrix, there is no mean defined for this. The thing that comes closest to a centroid will be a consensus sequence, this is typically computed after a multiple sequence alignment. Check out ClustalW or other tools.

ADD REPLY
0
Entering edit mode

Thank you very much for your answer. I did hierarchial clustering by using R.Do you know how to find medoid using R? Is there any package in R to find medoid?

ADD REPLY
0
Entering edit mode

Michael's last point about a consensus sequence is an extremely key point - a consensus sequence from an MSA may not be a member of the dataset. This needs to be stated more often. Thanks for stating this.

ADD REPLY
0
Entering edit mode

@ Michael .Thank you very much for your useful answer. could you give some information about centroid? How to find centroid by using R?

ADD REPLY
0
Entering edit mode

My data is numeric because first I did pairwise sequence alignment and calculated distance by using this formula "Distance=100-sequence identity" . This data is in excel file and I imported this excel file to R for doing Hierarchial clustering.

ADD REPLY
0
Entering edit mode

If I want to retrieve medoids from 30 clusters, can I use the code like this "clusters = cutree(hclust(mydist), k=30)" . I tried this code as k=30. But I am getting the following error. "Error in rowSums(distmat[ind, ind]) : 'x' must be an array of at least two dimensions". How can I resolve this error?

ADD REPLY
0
Entering edit mode

I too received the error:

Error in rowSums(distmat[ind, ind]) : 'x' must be an array of at least two dimensions"

This happens whenever I try 18 or more clusters in a dataset of 320 observations with 9 variables, all complete cases. Any tips that could help me resolve this error?

ADD REPLY
0
Entering edit mode

Ah, I saw the earlier comment about clusters with less than three members. I identified the observations, and removed one, ethically dubious, but it solved my problem.

ADD REPLY
5
Entering edit mode
12.6 years ago

Defining a representative sequence is being done in ad-hoc manner, but it is very important for the downstream analysis. Often your best representative will be used as a reference sequence for homology searches or sequence alignments.

I addressed this problem during my PhD and developed an objective approach / database to find a best representative profile / sequence / hmm model of a protein families using coverage analysis. We have also expanded this approach using three different programs and discussed the results in another manuscript. You may refer to the manuscripts to see if such an approach using coverage analysis is useful in your case.

ADD COMMENT
1
Entering edit mode

I'll add that in many cases this is the sort of approach you may want to take, it all depends on what you are using your clusters for. If the clusters are, for instance, protein families and you want to do searches of genome data, generating PSSM or HMM files from the clusters to search with is a better strategy as it covers the diversity within those clusters.

ADD REPLY

Login before adding your answer.

Traffic: 2269 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6