Preparing RNA-seq data for Hierarchical Clustering
1
1
Entering edit mode
6.2 years ago
LRStar ▴ 200

I am clustering RNA-seq data into groups of similar expression patterns and visualizing the results. To this end, I have been:

1) Logging and normalizing the RNA-seq data with the following two methods (first is from edgeR package, second is from EDASeq package)

cpm.data.new <- cpm(data, TRUE, TRUE)
betweenLaneNormalization(cpm.data.new, which="full", round=FALSE)

2) Standardizing each gene to have a mean=0 and standard deviation=1.

3) Performing hierarchical clustering (from the stats package) using ward.D linkage

hclust(d, method="ward.D")

The resulting clusters look pretty clean when plotted. However, I was trying to determine if this is a recommended approach to hierarchical of clustering gene expression (to normalize, log, and standardize in this manner)? It is unclear to me if the literature has a recommended method for preparing RNA-seq for hierarchical clustering?

Thank you for sharing any advice or information you may have.

RNA-Seq • 7.2k views
ADD COMMENT
4
Entering edit mode
6.2 years ago

Usually, yes, raw counts are normalised, logged, and then some other transformation can sometimes occur prior to clustering. I'd be interested to see your data distribution after your step 2). You can plot the distribution with hist()

The actual differential expression statistical tests would have been performed on the normalised un-logged counts.

The type of data that is used for hierarchical clustering should match the chosen distance metric when calculating the distance matrix using dist() (you have not shown how you used dist()). The default of Euclidean Distance assumes that your data follows a normal distribution; thus, if using the defaults and using RNA-seq data, you should be plotting logged or Z-scaled counts, or any other data that has a normal distribution. If you're using other counts, such as the negative binomial normalised counts prior to log transformation, then you could use 1 minus Pearson / Spearman correlation as your distance metric.

You have not mentioned heatmaps, but the heatmap functions from gplots package in R will calculate the row and column dendrograms from whatever data you supply to the function, but they then usually (default) perform a transformation that involves centering and then dividing by the standard deviation for the purposes of representing values in the heatmap itself. Here is the actual code from heatmap.2:

if (scale == "row") {
        retval$rowMeans <- rm <- rowMeans(x, na.rm = na.rm)
        x <- sweep(x, 1, rm)
        retval$rowSDs <- sx <- apply(x, 1, sd, na.rm = na.rm)
        x <- sweep(x, 1, sx, "/")
}
else if (scale == "column") {
        retval$colMeans <- rm <- colMeans(x, na.rm = na.rm)
        x <- sweep(x, 2, rm)
        retval$colSDs <- sx <- apply(x, 2, sd, na.rm = na.rm)
        x <- sweep(x, 2, sx, "/")
}

In fact, I have just seen a similar answer by my colleague Michael from all of 6 years ago: Scale Data Before Drawing Heatmap Or Using Heatmap(..., Scale="Columan") In R?

Kevin

ADD COMMENT
0
Entering edit mode

Hi Kevin. I'm new to this. Sorry if this question is a bit basic. When referring to "normal distribution", it means "all genes' expression in one sample" or "one gene's expression in all samples"?

ADD REPLY
0
Entering edit mode

Hello, by 'normal distribution', I mean this:

n

[souce: https://www.mathsisfun.com/data/standard-normal-distribution.html]

Logged and/or Z-scaled RNA-seq data should follow this distribution (but not always)

-------------------------------------

RNA-seq raw and nomalised data, however, follow the negative binomial distribution: d

ADD REPLY
1
Entering edit mode

Thank you! I'm looking over your answers about PCA. They are very helpful, thanks again!

ADD REPLY

Login before adding your answer.

Traffic: 2908 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6