Question: Preparing RNA-seq data for Hierarchical Clustering
gravatar for LRStar
2.6 years ago by
United States
LRStar190 wrote:

I am clustering RNA-seq data into groups of similar expression patterns and visualizing the results. To this end, I have been:

1) Logging and normalizing the RNA-seq data with the following two methods (first is from edgeR package, second is from EDASeq package) <- cpm(data, TRUE, TRUE)
betweenLaneNormalization(, which="full", round=FALSE)

2) Standardizing each gene to have a mean=0 and standard deviation=1.

3) Performing hierarchical clustering (from the stats package) using ward.D linkage

hclust(d, method="ward.D")

The resulting clusters look pretty clean when plotted. However, I was trying to determine if this is a recommended approach to hierarchical of clustering gene expression (to normalize, log, and standardize in this manner)? It is unclear to me if the literature has a recommended method for preparing RNA-seq for hierarchical clustering?

Thank you for sharing any advice or information you may have.

rna-seq • 3.7k views
ADD COMMENTlink modified 2.6 years ago by Kevin Blighe65k • written 2.6 years ago by LRStar190
gravatar for Kevin Blighe
2.6 years ago by
Kevin Blighe65k
Kevin Blighe65k wrote:

Usually, yes, raw counts are normalised, logged, and then some other transformation can sometimes occur prior to clustering. I'd be interested to see your data distribution after your step 2). You can plot the distribution with hist()

The actual differential expression statistical tests would have been performed on the normalised un-logged counts.

The type of data that is used for hierarchical clustering should match the chosen distance metric when calculating the distance matrix using dist() (you have not shown how you used dist()). The default of Euclidean Distance assumes that your data follows a normal distribution; thus, if using the defaults and using RNA-seq data, you should be plotting logged or Z-scaled counts, or any other data that has a normal distribution. If you're using other counts, such as the negative binomial normalised counts prior to log transformation, then you could use 1 minus Pearson / Spearman correlation as your distance metric.

You have not mentioned heatmaps, but the heatmap functions from gplots package in R will calculate the row and column dendrograms from whatever data you supply to the function, but they then usually (default) perform a transformation that involves centering and then dividing by the standard deviation for the purposes of representing values in the heatmap itself. Here is the actual code from heatmap.2:

if (scale == "row") {
        retval$rowMeans <- rm <- rowMeans(x, na.rm = na.rm)
        x <- sweep(x, 1, rm)
        retval$rowSDs <- sx <- apply(x, 1, sd, na.rm = na.rm)
        x <- sweep(x, 1, sx, "/")
else if (scale == "column") {
        retval$colMeans <- rm <- colMeans(x, na.rm = na.rm)
        x <- sweep(x, 2, rm)
        retval$colSDs <- sx <- apply(x, 2, sd, na.rm = na.rm)
        x <- sweep(x, 2, sx, "/")

In fact, I have just seen a similar answer by my colleague Michael from all of 6 years ago: Scale Data Before Drawing Heatmap Or Using Heatmap(..., Scale="Columan") In R?


ADD COMMENTlink written 2.6 years ago by Kevin Blighe65k

Hi Kevin. I'm new to this. Sorry if this question is a bit basic. When referring to "normal distribution", it means "all genes' expression in one sample" or "one gene's expression in all samples"?

ADD REPLYlink written 2.5 years ago by niuyw30

Hello, by 'normal distribution', I mean this:



Logged and/or Z-scaled RNA-seq data should follow this distribution (but not always)


RNA-seq raw and nomalised data, however, follow the negative binomial distribution: d

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Kevin Blighe65k

Thank you! I'm looking over your answers about PCA. They are very helpful, thanks again!

ADD REPLYlink written 2.5 years ago by niuyw30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1183 users visited in the last hour