Question: Hierarchial Clustering
3
gravatar for Sanju
7.7 years ago by
Sanju90
Sanju90 wrote:

How can i do hierarchical clustering of protein sequences which are in fasta format using R programming? Could you give R script for this?

R clustering programming • 4.7k views
ADD COMMENTlink modified 7.6 years ago by Darren J. Fitzpatrick1.1k • written 7.7 years ago by Sanju90
17
gravatar for Aleksandr Levchuk
7.7 years ago by
United States
Aleksandr Levchuk3.1k wrote:

It's funny that you're asking this, my boss just a few days ago sent us the following:

Quote:


Not a bad speed improvement for hierarchical clustering using flashClust. I don't know how it impacts memory usage...

#####################
## Simple Test Run ##
#####################
y <- matrix(rnorm(25000), 5000, 5, dimnames=list(paste("g", 1:5000, sep=""), paste("t", 1:5, sep="")))

## Clustering with standard hclust (before loading flashClust library)
system.time(hclust(dist(y[1:5000,], method = "euclidean"), method="complete"))
  user  system elapsed
276.465   0.716 277.169

## Clustering with flashClust
library(flashClust)
system.time(hclust(dist(y[1:5000,], method = "euclidean"), method="complete"))
  user  system elapsed
 4.352   0.784   5.137

...end of quote.

@krishna, I wrote up the whole pipeline for you. (1) It uses Blast to align the sequences all-against-all; (2) Takes the resulting e-value to construct the all-against-all distance matrix; (3) Uses flashClust R library to generate the hierarchical clustering; (4) Saves the clusters R object to a file and generates a plot of the deprogram as PNG:

To run the whole thing, do this:

mkdir hclust-fasta
cd hclust-fasta

wget https://raw.github.com/alevchuk/hclust-fasta/master/my.fasta
wget https://raw.github.com/alevchuk/hclust-fasta/master/001-blast-aaa
wget https://raw.github.com/alevchuk/hclust-fasta/master/002-load-blast-m8
wget https://raw.github.com/alevchuk/hclust-fasta/master/003-hclust

chmod +x 0*

./001-blast-aaa my.fasta
./002-load-blast-m8 my.fasta
./003-hclust my.fasta

The resulting deprogram will look like this: dend1

ADD COMMENTlink modified 7.7 years ago • written 7.7 years ago by Aleksandr Levchuk3.1k
2

Wow, what a great answer!

ADD REPLYlink written 7.7 years ago by Istvan Albert ♦♦ 79k

Great answer! But I'm going to be picky: why use 'blastall' when it's now deprecated and everything has been merged into blast+? Despite this, it is still a great answer! :-)

ADD REPLYlink written 7.7 years ago by Leonor Palmeira3.7k

@Leonor Palmeira, good point, I will tryout Blast+.

ADD REPLYlink written 7.7 years ago by Aleksandr Levchuk3.1k
2
gravatar for Qdjm
7.7 years ago by
Qdjm1.9k
Toronto
Qdjm1.9k wrote:

I don't use R but just by doing some quick poking around the "Related" section of your question, I came up with a couple previous answers that might be helpful.

The first reveals a hierarchical clustering function in R (i.e., hclust):

http://biostar.stackexchange.com/questions/7975/using-r-to-perform-statistical-tests-on-microarray-data-and-cluster-the-results

The second has some details that you might want to consider when writing this code:

http://biostar.stackexchange.com/questions/2536/clustering-of-protein-sequences

Is there any particular reason that this has to be coded in R? There are already a number of software packages for doing this, why re-invent the wheel?

ADD COMMENTlink written 7.7 years ago by Qdjm1.9k

Thank you very much

ADD REPLYlink written 7.7 years ago by Sanju90
2
gravatar for Lyco
7.7 years ago by
Lyco2.3k
Germany
Lyco2.3k wrote:

I understand that the question is about clustering FASTA-sequences, not microarray data. If your aim is to get the 'sequence clusters' (i.e. groups of related sequences in a collection of mostly unrelated sequences), you should have a look at the answers of this question, which is very similar to yours. If it is the hierarchical structure you are after, this would amount to a phylogenetic tree construction (assuming that all the sequences are related). There are lots of discussions here dealing with this issue, many people here recommend RAxML. I have never used it myself, though.

ADD COMMENTlink written 7.7 years ago by Lyco2.3k

Thank you very much

ADD REPLYlink written 7.7 years ago by Sanju90
2
gravatar for Darren J. Fitzpatrick
7.7 years ago by
Ireland/ United Kingdom
Darren J. Fitzpatrick1.1k wrote:

In order to do hierarchical clustering, you first require some measure of distance/similarity between your sequences.

By this I mean the following:

Given a set of sequences N, compute the pairwise distance between each sequence s, in N and all other sequences in N. This will allow you to create a distance matrix that will subsequently be clustered.

R gives a range of possibilities for generating distances matrices, e.g., Euclidean, Manhattan, etc. These may not be suitable for measuring the distance between sequences. Perhaps the hamming distance or some measure of pairwise conservation (depending on what you wish to explore using the clusters) will be more appropriate.

My guess is that you will have to generate your own custom distance matrix using an appropriate measure for your data and then do hierarchical clustering.

Example code

Given a custom distance matrix X:

hc <- hclust(as.dist(X), method='average')

Note, you will have to choose amongst methods of hierarchical clustering, viz., single, average and complete.

Good tutorials on clustering in R are given below:

http://cc.oulu.fi/~jarioksa/opetus/metodi/sessio3.pdf

http://www.statmethods.net/advstats/cluster.html

Good Luck!

ADD COMMENTlink written 7.7 years ago by Darren J. Fitzpatrick1.1k

Thank you very much

ADD REPLYlink written 7.7 years ago by Sanju90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1115 users visited in the last hour