Question: Clustering Data (Rna-Seq) Using R To Produce A Heatmap
16
gravatar for Kanne
7.0 years ago by
Kanne400
Australia
Kanne400 wrote:

I have RNA-seq data (FPKMs) from Cufflinks and would like to cluster it by gene and produce a heatmap.

This is my first try at using R and I have spent a LOT of time pouring over the manual/help pages and internet tutorials on how to do this.

I can now produce heatmaps using "heatmap" easily enough, my problem is that I can produce them from many different versions/transformations of my data and I cannot figure out what is going on and which heatmap is the analysis I am interested in.

What I am trying to get is a) gene names clustered by expression profile, to mine for enriched gene groups/pathways; and b) a heatmap of FPKM values, with the same gene clustering.

This is the R code: Data input/preparation

 m <- data.frame(read.table("DMSTSC1000_notmeanctrd.txt", header=T, sep="\t"))
 row.names(m) <- m$test_id
 m <- m[,2:7]
 m_matrix <- data.matrix(m)

Making Heatmap version 1:

heatmap(m_matrix, Colv=NA, scale="column")

Making Heatmap version 2. This came about because a paper described using a Pearson correlation metric with clustering, but this heatmap looks terrible, clustering appears to bear little relationship with imaged data:

cor_t <- cor(t(m_matrix))
distancet <- as.dist(cor_t)
hclust_complete <- hclust(distancet, method = "complete")
dendcomplete <- as.dendrogram(hclust_complete)
heatmap(m_matrix, Rowv=dendcomplete, Colv=NA, scale="column")

Making Heatmap version 3

distancem <- dist(m_matrix)
hclust_completem <- hclust(distancem, method = "complete")
dendcompletem <- as.dendrogram(hclust_completem)
heatmap(m_matrix, Rowv=dendcompletem, Colv=NA, scale="column")

Or, if you have code for a fourth way that you're confident about, I'd love to hear it! I tried to use pam but haven't been able to produce a heatmap from it yet.

Sorry about not uploading images, I haven't figured out how to web-host them yet.

Details: FPKM data has been log2 transformed and high outliers were capped at a maximum value (10), to increase the range of colors used for the majority of the data.

Thank you in advance for your help, it is very much appreciated!!

gene R rna clustering heatmap • 56k views
ADD COMMENTlink modified 3.6 years ago by Bohdan Khomtchouk320 • written 7.0 years ago by Kanne400

Maybe you should change the title since, from what I understood, it seems your problem is more about choosing clustering methods than generating and analyzing heatmaps which you seem to know how to do.

ADD REPLYlink written 7.0 years ago by Philippe1.9k

True, will do, thanks!

ADD REPLYlink written 7.0 years ago by Kanne400
16
gravatar for Michael Dondrup
7.0 years ago by
Bergen, Norway
Michael Dondrup45k wrote:

To shorten your search: there is no correct answer and no best method for choosing distance measures in cluster analysis, if there was everybody would be using it. In data-mining, there are a gazillion of methods, and each method has different characteristics, making different aspects in the data visible. The idea is not to rely on a single best method, but try several that will aid your process to generate new hypotheses about the data.

That said, there is one important requirement for distance measures, which is not valid in your choice of correlation as a distance. I'd phrase it like that: similar objects have close to d = 0, dissimilar objects have d>0, the more dissimilar the larger d, however correlation range is -1<= r <=1 and has adverse behavior, so there are at least some possibilities with different characteristics to turn correlation into distance:

  • correlation distance d := 1 - r (anti-correlation: d=2, no correlation, d=1, full correlation: d=0 )
  • absolute correlation distance: d := 1-|r| (edit: d := |1-r| was a little mistake, because the result is identical to the first distance)
  • r-squared distance: d := 1 - r^2 (no correlation d=1, anti- and full correlation: d=0)

This explains why your attempt using correlation distance didn't work out. Therefore try the following R-code, and see if it improves things:

cor_t <- 1 - cor(t(m_matrix)) # or
cor_t <- 1 - abs(cor(t(m_matrix))) # edited
cor_t <- 1 - cor(t(m_matrix))^2

These are still no real distance metric because they break the triangle inequality, but still.

ADD COMMENTlink modified 5.4 years ago • written 7.0 years ago by Michael Dondrup45k

Wow, good point! Thanks a lot, especially for the code, that is very helpful!!

ADD REPLYlink written 7.0 years ago by Kanne400

very helpful explanation..thanks

ADD REPLYlink written 6.3 years ago by Abhi1.5k
2
gravatar for Charles Warden
5.0 years ago by
Charles Warden5.6k
Duarte, CA
Charles Warden5.6k wrote:

You might also want to try out heatmap.2 from the gplots package:

http://cran.r-project.org/web/packages/gplots/index.html

http://mannheimiagoesprogramming.blogspot.com/2012/06/drawing-heatmaps-in-r-with-heatmap2.html

I think it has a little more functionality that is useful for gene expression visualization.

ADD COMMENTlink written 5.0 years ago by Charles Warden5.6k
1
gravatar for Bohdan Khomtchouk
3.6 years ago by
Stanford University
Bohdan Khomtchouk320 wrote:

"I can now produce heatmaps using "heatmap" easily enough, my problem is that I can produce them from many different versions/transformations of my data and I cannot figure out what is going on and which heatmap is the analysis I am interested in."

>> HeatmapGenerator has a database storage system which stores any heatmap you have ever produced along with its corresponding name so that you can always refer back to a heatmap you made in the past from a central repository.  Source: http://sourceforge.net/projects/heatmapgenerator/ 

ADD COMMENTlink written 3.6 years ago by Bohdan Khomtchouk320
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1600 users visited in the last hour