Question

RNA-seq z-score normalization prior to clustering

2

Entering edit mode

7.7 years ago

acorella ▴ 30

Hi,

I would like to perform unsupervised hierarchical clustering on some RNA-seq data, but I was told I need to normalize the data by z-score per gene.

My question is: what type of RNAseq data should z-score normalization be performed on? Is it better to do the normalization on RPKM, CPM, log2 CPM, etc?

I typically represent my RNAseq data as mean-centered log2 CPM: Can I perform z-score normalization per gene on mean-centered log2 CPM? Or is this not advised?

Thanks!

RNA-Seq normalization • 13k views

ADD COMMENT • link updated 7.7 years ago by Ron ★ 1.2k • written 7.7 years ago by acorella ▴ 30

0

Entering edit mode

@acorella Can you define the z-socre? is it the z-score normalisation that for each element of a given data as such that e.g. a vector of expression is centered to have mean 0 and scaled to have standard deviation 1? After checking , I came across this post. I believe this is your answer TCGA: What are mRNA expression z-scores? Does TCGA have mRNA expression from controls?

ADD REPLY • link 7.7 years ago by Mo ▴ 920

0

Entering edit mode

Hi, yes that is the z-score I am referring to, however that post does not answer my question.

My question is, is it appropriate to z-score normalize mean-centered log2 CPM values? Does it matter what type of RNAseq values (CPM, RPKM, TPM, etc) I use to z-score normalize?

ADD REPLY • link 7.7 years ago by acorella ▴ 30

0

Entering edit mode

What is the structure of your data, e.g. #samples, #conditions, #replicates/condition? And are you attempting to find clusters of samples? Or genes? Depending on what you are interested in, using z-scores may not be necessary.

ADD REPLY • link 7.7 years ago by keith.hughitt ▴ 280

score 5 · Answer 1 · 2016-08-18

Hi Acorella,

The Z-score normalisation only really makes sense if the expression values for a given are (approximately) normally distributed. One would expect RPKM to be approximately log normally distributed and CPM to be approximately negative binomially distributed, assuming CPM = counts per million? If you wanted to work from the RPKm or CPM, I'd suggest using the log RPKMs - I'm not sure what to expect from log CPMs?

However, I think you'd be much better off using transcripts per million (TPM) as your unit of expression (see Question: the problem with rpkm (and tpm), and What the FPKM? A review of RNA-Seq expression units). The second link also explains how to convert from RPKM/FPKM to TPM. log(TPM) will be approximately normally distributed and suitable for calculating z-scores.

score 0 · Answer 2 · 2016-08-22

0

Entering edit mode

7.7 years ago

Ron ★ 1.2k

Log2CPM can be used to do unsupervised clustering.This should work.

ADD COMMENT • link 7.7 years ago by Ron ★ 1.2k