The heatmap command
library(gplots)
heatmap.2(lcpm, col=redgreen(10))
gave following graph. How to interpret color key histogram ? Why there are two bright green bands on top and bottom, what it signifies ?
The heatmap command
library(gplots)
heatmap.2(lcpm, col=redgreen(10))
gave following graph. How to interpret color key histogram ? Why there are two bright green bands on top and bottom, what it signifies ?
This does not look right at all. The count distribution via the key and histogram is biased toward high counts (green), for whatever reason. Scaling this distribution to be more evenly distributed will assist visualisation, and also setting colour break-points accordingly.
Can you please follow my simple approach here, and see what you get: A: Heatmap based with FPKM values
Kevin
Hi Björn,
Yes, you definitely need to scale them, as per the lines of code in the link that I gave (above), i.e., literally, execute this:
require("RColorBrewer")
myCol <- colorRampPalette(c("dodgerblue", "black", "yellow"))(100)
myBreaks <- seq(-3, 3, length.out=101)
#Scale the values to the Z scale
heat <- t(scale(t(lcpm)))
#Generate heatmaps with Euclidean distance (first) and '1 - Pearson correlation' distance (second) (both use Ward's linkage)
require("gplots")
#Euclidean distance
heatmap.2(heat, col=myCol, breaks=myBreaks, main="Title", key=T, keysize=1.0, scale="none", density.info="none", reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean), trace="none", cexRow=0.2, cexCol=0.8, distfun=function(x) dist(x, method="euclidean"), hclustfun=function(x) hclust(x, method="ward.D2"))
#1-cor distance
heatmap.2(heat, col=myCol, breaks=myBreaks, main="Title", key=T, keysize=1.0, scale="none", density.info="none", reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean), trace="none", cexRow=0.2, cexCol=0.8, distfun=function(x) as.dist(1-cor(t(x))), hclustfun=function(x) hclust(x, method="ward.D2"))
Regarding rownames, you can usually set these with the labRow
parameter of heatmap.2()
. For exampl:
heatmap2(..., labRow=rownames(lcpm),...)
This obviously requires that the rownames of your lcpm object are indeed gene names. You can check by just running rownames(lcpm)
The heatmap.2() function does not perform scaling of the data by default (unlike the base heatmap function). Thus your data has bright green bands because those rows have values in the upper range of your distribution and you've set high values to be green, while the black rows have values close to zero. The histogram in the key is telling you the frequency of values overall in the data set. You can see that the distribution has a right shoulder towards green. The trace down each column is telling you the size of the measurements by column position.
Why did you draw this heat map? What did you expect to see? Depending on your needs, you might choose to scale the data (see ?heatmap.2 ), or remove rows with little variation.
There's still something strange about your data, as you describe it:
lcpm<-cpm(y) to find DE genes
This is not the command to find DE genes using edgeR. It is the command to transform feature counts from integers into counts per million.
what does x-scale -20 to 20 means?
the key and scale are telling you the range of values you are plotting. Thus, according to your heat map, you have values ranging from -20 to 20, and you can see from the histogram that you have more positive than negative values.
I'm guessing, that since your heat map scale is -20 to 20, more or less centered at 0, you are actually plotting log ratios, and perhaps don't realize it.
To shorlist DE genes, should I limit only those in bright green bands, then how ?
To shortlist genes, you would not use the heat map, you would use the p-values (hopefully adjusted p-values) from edgeR. However, you can use the heatmap to isolate certain rows if you like. I would suggest you play with some toy examples to get a sense of what's going on.
# create a toy matrix of count data
x <- matrix(round(runif(100,1,200)), nrow=20, ncol=10)
# name the rows like they are genes
rownames(x) <- paste0("g", 1:nrow(x))
# you could plot this just to see what it looks like
library(gplots)
heatmap.2(x, col=redgreen(10), Rowv=FALSE, Colv=FALSE)
# you should see the key telling you the value range
# you should see row labels
# you said you used the cpm function and plotted the results
library(edgeR)
x.cpm <- cpm(x)
heatmap.2(x.cpm, col=redgreen(10))
# most likely you're plotting log ratios
# make some data resembling this
x <- matrix(rnorm(200), nrow=20, ncol=10)
rownames(x) <- paste0("g", 1:nrow(x))
heatmap.2(x, col=redgreen(10))
# once again, notice the color key
# Examine the effect of scaling
# to do this, let's add postive values to some rows
x[1:3,] <- x[1:3,] + 2
heatmap.2(x, col=redgreen(10))
# now we see that those rows are clearly
# different than the rest
# but if we scaled the data plot
heatmap.2(x, scale="row", col=redgreen(10))
# visually the difference goes away, but the
# dendrogram still tells us those rows are different
# Notice also the new values in the legend!!! (Z-scores)
# we can pick out certain rows by saving information
# from drawing the heatmap:
hm <- heatmap.2(x, col=redgreen(10))
# Examine the hm object:
hm
# see that it contains row indicies. we can examine our
# data table re-ordered like the heatmap
x[hm$rowInd,]
# but it's actually drawn upside down, so reverse the index
x[rev(hm$rowInd),]
# This might give you a clue for selecting certain rows
# so select genes with multiple positive values
iv <- apply(x, 1, function(x) sum(x > 1) >= 4)
rownames(x[iv,])
DESeq2 provides read count transformation in order to visualize them as a heatmap.
Look at : - https://bioconductor.org/packages/3.7/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization for transformation and https://bioconductor.org/packages/3.7/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#heatmap-of-the-count-matrix for heatmap
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
You can also use the
scale="row"
heatmap.2 parameter.Additional question on the heatmap. how to get the real gene names (row names) instead of numbers ?