Question

Preparing RNA-seq data for Hierarchical Clustering

1

Entering edit mode

6.2 years ago

LRStar ▴ 200

I am clustering RNA-seq data into groups of similar expression patterns and visualizing the results. To this end, I have been:

1) Logging and normalizing the RNA-seq data with the following two methods (first is from edgeR package, second is from EDASeq package)

cpm.data.new <- cpm(data, TRUE, TRUE)
betweenLaneNormalization(cpm.data.new, which="full", round=FALSE)

2) Standardizing each gene to have a mean=0 and standard deviation=1.

3) Performing hierarchical clustering (from the stats package) using ward.D linkage

hclust(d, method="ward.D")

The resulting clusters look pretty clean when plotted. However, I was trying to determine if this is a recommended approach to hierarchical of clustering gene expression (to normalize, log, and standardize in this manner)? It is unclear to me if the literature has a recommended method for preparing RNA-seq for hierarchical clustering?

Thank you for sharing any advice or information you may have.

RNA-Seq • 7.2k views

ADD COMMENT • link updated 6.2 years ago by Kevin Blighe 87k • written 6.2 years ago by LRStar ▴ 200

score 4 · Answer 1 · 2018-02-10

Usually, yes, raw counts are normalised, logged, and then some other transformation can sometimes occur prior to clustering. I'd be interested to see your data distribution after your step 2). You can plot the distribution with hist()

The actual differential expression statistical tests would have been performed on the normalised un-logged counts.

The type of data that is used for hierarchical clustering should match the chosen distance metric when calculating the distance matrix using dist() (you have not shown how you used dist()). The default of Euclidean Distance assumes that your data follows a normal distribution; thus, if using the defaults and using RNA-seq data, you should be plotting logged or Z-scaled counts, or any other data that has a normal distribution. If you're using other counts, such as the negative binomial normalised counts prior to log transformation, then you could use 1 minus Pearson / Spearman correlation as your distance metric.

You have not mentioned heatmaps, but the heatmap functions from gplots package in R will calculate the row and column dendrograms from whatever data you supply to the function, but they then usually (default) perform a transformation that involves centering and then dividing by the standard deviation for the purposes of representing values in the heatmap itself. Here is the actual code from heatmap.2:

if (scale == "row") {
        retval$rowMeans <- rm <- rowMeans(x, na.rm = na.rm)
        x <- sweep(x, 1, rm)
        retval$rowSDs <- sx <- apply(x, 1, sd, na.rm = na.rm)
        x <- sweep(x, 1, sx, "/")
}
else if (scale == "column") {
        retval$colMeans <- rm <- colMeans(x, na.rm = na.rm)
        x <- sweep(x, 2, rm)
        retval$colSDs <- sx <- apply(x, 2, sd, na.rm = na.rm)
        x <- sweep(x, 2, sx, "/")
}

In fact, I have just seen a similar answer by my colleague Michael from all of 6 years ago: Scale Data Before Drawing Heatmap Or Using Heatmap(..., Scale="Columan") In R?

Kevin

-------------------------------------