Question: RNA-seq z-score normalization prior to clustering
gravatar for acorella
2.7 years ago by
United States
acorella30 wrote:


I would like to perform unsupervised hierarchical clustering on some RNA-seq data, but I was told I need to normalize the data by z-score per gene.

My question is: what type of RNAseq data should z-score normalization be performed on? Is it better to do the normalization on RPKM, CPM, log2 CPM, etc?

I typically represent my RNAseq data as mean-centered log2 CPM: Can I perform z-score normalization per gene on mean-centered log2 CPM? Or is this not advised?


rna-seq normalization • 4.0k views
ADD COMMENTlink modified 2.7 years ago by Ron920 • written 2.7 years ago by acorella30

@acorella Can you define the z-socre? is it the z-score normalisation that for each element of a given data as such that e.g. a vector of expression is centered to have mean 0 and scaled to have standard deviation 1? After checking , I came across this post. I believe this is your answer TCGA: What are mRNA expression z-scores? Does TCGA have mRNA expression from controls?

ADD REPLYlink written 2.7 years ago by Mo880

Hi, yes that is the z-score I am referring to, however that post does not answer my question.

My question is, is it appropriate to z-score normalize mean-centered log2 CPM values? Does it matter what type of RNAseq values (CPM, RPKM, TPM, etc) I use to z-score normalize?

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by acorella30

What is the structure of your data, e.g. #samples, #conditions, #replicates/condition? And are you attempting to find clusters of samples? Or genes? Depending on what you are interested in, using z-scores may not be necessary.

ADD REPLYlink written 2.7 years ago by keith.hughitt250
gravatar for thomas.smith2
2.7 years ago by
United Kingdom
thomas.smith290 wrote:

Hi Acorella,

The Z-score normalisation only really makes sense if the expression values for a given are (approximately) normally distributed. One would expect RPKM to be approximately log normally distributed and CPM to be approximately negative binomially distributed, assuming CPM = counts per million? If you wanted to work from the RPKm or CPM, I'd suggest using the log RPKMs - I'm not sure what to expect from log CPMs?

However, I think you'd be much better off using transcripts per million (TPM) as your unit of expression (see Question: the problem with rpkm (and tpm), and What the FPKM? A review of RNA-Seq expression units). The second link also explains how to convert from RPKM/FPKM to TPM. log(TPM) will be approximately normally distributed and suitable for calculating z-scores.

ADD COMMENTlink written 2.7 years ago by thomas.smith290
gravatar for Ron
2.7 years ago by
United States
Ron920 wrote:

Log2CPM can be used to do unsupervised clustering.This should work.

ADD COMMENTlink written 2.7 years ago by Ron920
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 666 users visited in the last hour