Question: Preparing RNA-seq data for Hierarchical Clustering
gravatar for lrutter0306
5 months ago by
United States
lrutter030610 wrote:

I am clustering RNA-seq data into groups of similar expression patterns and visualizing the results. To this end, I have been:

1) Logging and normalizing the RNA-seq data with the following two methods (first is from edgeR package, second is from EDASeq package) <- cpm(data, TRUE, TRUE)
betweenLaneNormalization(, which="full", round=FALSE)

2) Standardizing each gene to have a mean=0 and standard deviation=1.

3) Performing hierarchical clustering (from the stats package) using ward.D linkage

hclust(d, method="ward.D")

The resulting clusters look pretty clean when plotted. However, I was trying to determine if this is a recommended approach to hierarchical of clustering gene expression (to normalize, log, and standardize in this manner)? It is unclear to me if the literature has a recommended method for preparing RNA-seq for hierarchical clustering?

Thank you for sharing any advice or information you may have.

rna-seq • 557 views
ADD COMMENTlink modified 5 months ago by Kevin Blighe24k • written 5 months ago by lrutter030610
gravatar for Kevin Blighe
5 months ago by
Kevin Blighe24k
Republic of Ireland
Kevin Blighe24k wrote:

Usually, yes, raw counts are normalised, logged, and then some other transformation can sometimes occur prior to clustering. I'd be interested to see your data distribution after your step 2). You can plot the distribution with hist()

The actual differential expression statistical tests would have been performed on the normalised un-logged counts.

The type of data that is used for hierarchical clustering should match the chosen distance metric when calculating the distance matrix using dist() (you have not shown how you used dist()). The default of Euclidean Distance assumes that your data follows a normal distribution; thus, if using the defaults and using RNA-seq data, you should be plotting logged or Z-scaled counts, or any other data that has a normal distribution. If you're using other counts, such as the negative binomial normalised counts prior to log transformation, then you could use 1 minus Pearson / Spearman correlation as your distance metric.

You have not mentioned heatmaps, but the heatmap functions from gplots package in R will calculate the row and column dendrograms from whatever data you supply to the function, but they then usually (default) perform a transformation that involves centering and then dividing by the standard deviation for the purposes of representing values in the heatmap itself. Here is the actual code from heatmap.2:

if (scale == "row") {
        retval$rowMeans <- rm <- rowMeans(x, na.rm = na.rm)
        x <- sweep(x, 1, rm)
        retval$rowSDs <- sx <- apply(x, 1, sd, na.rm = na.rm)
        x <- sweep(x, 1, sx, "/")
else if (scale == "column") {
        retval$colMeans <- rm <- colMeans(x, na.rm = na.rm)
        x <- sweep(x, 2, rm)
        retval$colSDs <- sx <- apply(x, 2, sd, na.rm = na.rm)
        x <- sweep(x, 2, sx, "/")

In fact, I have just seen a similar answer by my colleague Michael from all of 6 years ago: Scale Data Before Drawing Heatmap Or Using Heatmap(..., Scale="Columan") In R?


ADD COMMENTlink written 5 months ago by Kevin Blighe24k

Hi Kevin. I'm new to this. Sorry if this question is a bit basic. When referring to "normal distribution", it means "all genes' expression in one sample" or "one gene's expression in all samples"?

ADD REPLYlink written 4 months ago by niuyw20

Hello, by 'normal distribution', I mean this:



Logged and/or Z-scaled RNA-seq data should follow this distribution (but not always)


RNA-seq raw and nomalised data, however, follow the negative binomial distribution: d

ADD REPLYlink modified 4 months ago • written 4 months ago by Kevin Blighe24k

Thank you! I'm looking over your answers about PCA. They are very helpful, thanks again!

ADD REPLYlink written 4 months ago by niuyw20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1506 users visited in the last hour