Question

Using RPKM data of RNA-seq for Hierarchical clustering

2

Entering edit mode

10.0 years ago

simonhb1990 ▴ 20

Hi,

I only have RPKM data from RNA-seq now, and want to make it for hierarchical clustering.

My question is whether I need to apply the log transformation for the RPKM data before the clustering? or I can directly calculate the zscore for the data to do clustering.

I have this question because I think the goal for log transformation is to scale the ratio of change, especially for microarray data. Since the data I have now is not a ratio, I think maybe no need to do this transformation.

regards,
Simon

RNA-Seq • 4.5k views

ADD COMMENT • link updated 3.8 years ago by Ram 45k • written 10.0 years ago by simonhb1990 ▴ 20

0

Entering edit mode

How do you cluster? How do you measure distance? The log transformation might have no effect or it might be crucial depending on the distance function.

After you cluster, try to look for batch effect, I'm curious how the experiment might influence the data.

ADD REPLY • link updated 3.8 years ago by Ram 45k • written 10.0 years ago by Asaf 10k

0

Entering edit mode

I used hierarchical average linkage clustering using Euclidean distance by Cluster 3.0.

ADD REPLY • link updated 3.8 years ago by Ram 45k • written 10.0 years ago by simonhb1990 ▴ 20

Ram · Answer 1 · 2015-07-21

1

Entering edit mode

10.0 years ago

matt.newman ▴ 170

I still think you would need to log it before clustering it. The purpose of taking the log of data is to reduce the effect of outliers. That will apply here as well. I know in our software (OncoLand) we do this when looking at heatmaps of RPKM data and automatically clustering.

ADD COMMENT • link updated 3.8 years ago by Ram 45k • written 10.0 years ago by matt.newman ▴ 170

Ram · Answer 2 · 2015-07-21

The choice may depend on why you are clustering it. I find that mean centering the data and representing it as z-scores (also called scaling the data) as you mentioned, is a generally useful way to group genes with common behaviors. There's a function in R called: scale() that does this. Though beware as it scales columns by default so you have to wrap your call in t() to transpose it.