Question: How to cluster the upregulated and downregulated genes in heatmap?
4
bioinforesearchquestions280 wrote:

How to cluster the upregulated and downregulated genes in heatmap?

Initial heatmap: Expected heatmap heatmap rna-seq • 3.5k views
modified 2.8 years ago by Kevin Blighe63k • written 2.9 years ago by bioinforesearchquestions280
5
Kevin Blighe63k wrote:

You can try messing around with different combinations of the distance, linkage, and re-order functions. With the `heatmap.2` function (assuming that you're using `heatmap.2`), you can specify the following as parameters:

``````#Re-order rows/columns by mean, use 1-Pearson's correlation distance, and complete linkage
heatmap.2(...,
reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean),
distfun=function(x) as.dist(1-cor(t(x))),
hclustfun=function(x) hclust(x, method="complete"))

#Re-order rows/columns by mean, use Euclidean distance, and Ward's linkage
heatmap.2(...,
reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean),
distfun=function(x) dist(x, method="euclidean"),
hclustfun=function(x) hclust(x, method="ward.D2"))
``````

Various other combinations exist, such as Manhattan and Canberra distance, coupled with simple or average linkage

Also experiment with setting your own breaks for heatmap shading, and scaling the data yourself to Z-scores (or other values)

``````myBreaks <- seq(-3, 3, length.out=101)
heat <- t(scale(t(MyDataMatrix)))
heatmap.2(..., breaks=myBreaks, scale="none")
``````

If none of that works, as a last resort, you can order the rows yourself in whatever way you want, and then you 'fix' these in place by switching off the row dendrogram, but in this way you lose the dendrogram. Take a look at the parameters `Rowv` and `dendrogram` to see how you can do this. See here: https://www.rdocumentation.org/packages/gplots/versions/3.0.1/topics/heatmap.2

Thanks, Kevin. Sure, I will try them.

Great - let me know how it goes!

Hi Kevin,

After incorpating the "1-Pearson's correlation distance", How do people generally show significant genes in heatmap more than 100. I have 620 significant genes (q-value <=0.05)

1

Looks great!

Yes, labeling is a major issue, but there are different ways of tackling it:

# Modify `cexRow` and change the dimensions of the heatmap

`cexRow` controls the size of the labels, as you probably know, whilst modifying the dimensions of the heatmap could work whereby you elongate the heatmap. For example, try the following:

``````pdf("MyHeatmap.pdf", width=5, height=11)
par(mar=c(2,2,2,2), cex=1.0)
heatmap.2(..., cexRow=0.6)
dev.off()
``````

# Only include certain genes in the labels

Here you can use a vector as the rownames and only include certain key genes in it. For example, the vector could be:

``````myKeyGenes <- c("", "", "TP53", "", "", "", "BRCA1", ..., "geneX")
``````

In `heatmap.2`, you then specify this with `labRow=myKeyGenes`. The order of the vector has to match the order of your data-matrix that is used for clustering. You can then use a normal-sized value for `cexRow`, as most of the labels are blank spaces.

# Use a color-vector and switch off labelling

Here, you provide a color vector instead of labels and set it with `RowSideColors` in `heatmap.2`. For example, you could shade genes of a certain pathway in one color, or transcripts that are non-coding RNAs.

...of course, you can also use a combination of all of these.

Hi Kevin,

I have the excel file generated from Cuffdiff output for genes with the following columns

test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value significant

As per the excel file, sample_1 is Mutant and sample_2 is Wildtype. Log2(fold_change) is calculated as log2(sample_2/sample_1).

I thought it should be log2(final/initial), isn't it?

what is the difference between log2(Mutant/Wildtype) or log2(Wildtype/Mutant)?

Hi friend, the difference is just in the interpretation.

If, for GeneX, Sample1's expression is 20 and Sample2's expression is 5, then:

``````log2(Sample1/Sample2) = 2
``````

We can make the following statement: Sample1 has higher expression than Sample2 for GeneX

``````log2(Sample2/Sample1) = -2
``````

We can make the following statement: Sample2 has lesser expression than Sample1 for GeneX

Both statements are implying the same thing. You can see, however, that the choice of nominator and denominator is important.

Hi Kevin,

I have a similar problem but I am not able to reorder my data as I have missing values in some columns, could you please take a look at my thread?

Thanks !