Question

Z-score calculation from raw counts data

3

Entering edit mode

5.6 years ago

Biologist ▴ 290

I have RNA-Seq data with raw counts in dataframe 'counts' having samples as columns and genes as rows. It looks something like below:

                           Sample1    Sample2       Sample3    Sample4       Sample5
ENSG00000243485.5            0            0            0            1            0
ENSG00000237613.2            0            0            0            0            0
ENSG00000238009.6           10            5            6           16           13
ENSG00000239945.1            0            0            0            0            0
ENSG00000239906.1            1            0            0            9            0
ENSG00000241860.6           16           65           19           56           82

And the sample annotation looks like below:

Samples      N_stage          M_stage
Sample1         N0               M1
Sample2         N1               M0
Sample3         N0               MX
Sample4         N2               M0
Sample5         N3               MX

I wanted to make a heatmap to show the sample stages. So, first I converted counts to logCPM.

logCPM <- cpm(counts, prior.count=2, log=TRUE)

logCPM looks like below:

                      Sample1      Sample2      Sample3      Sample4      Sample5
ENSG00000243485.5    0.3915381    0.3915381    0.3915381    1.0122705    0.3915381
ENSG00000237613.2    0.3915381    0.3915381    0.3915381    0.3915381    0.3915381
ENSG00000238009.6    2.7079614    2.1523650    2.3686665    3.6549466    3.5064449
ENSG00000239945.1    0.3915381    0.3915381    0.3915381    0.3915381    0.3915381
ENSG00000239906.1    0.8750013    0.3915381    0.3915381    2.9372348    0.3915381
ENSG00000241860.6    3.2731113    5.3940606    3.7562189    5.3507849    6.0161469

Converted logCPM to Z-score.

Z-Score <- t(scale(t(logCPM)))

                     Sample1       Sample2      Sample3      Sample4      Sample5
ENSG00000243485.5  -0.34352108  -0.34352108  -0.34352108   2.46775178  -0.34352108
ENSG00000237613.2  -0.07930516  -0.07930516  -0.07930516  -0.07930516  -0.07930516
ENSG00000238009.6  -0.40081457  -1.01397031  -0.77526011   0.64427767   0.48039136
ENSG00000239945.1   0.19393322   0.19393322   0.19393322   0.19393322   0.19393322
ENSG00000239906.1   0.11017422  -0.78784826  -0.78784826   3.94072835  -0.78784826
ENSG00000241860.6  -1.35704877   0.13799168  -1.01650998   0.10748697   0.57649536

Now I calculated IQR and selected top 10% genes for plotting. Is the above Z-score calculation is right? Little confused to understand. How the gene ENSG00000238009.6 has negative values in the first three samples and positive values in the sample 4 and sample 5? For the same gene logCPM is more than 2 in all samples. Why it is negative in Z-score?

RNA-Seq heatmap zscore logcpm • 6.1k views

ADD COMMENT • link updated 5.6 years ago by Michael 54k • written 5.6 years ago by Biologist ▴ 290

score 1 · Answer 1 · 2018-10-01

1

Entering edit mode

5.6 years ago

Michael 54k

scale is generic function whose default method centers and/or scales the columns of a numeric matrix.

Now, if your columns are your samples, you would scale rows instead of columns. So no:

t(scale(t(logCPM)))

Z-score is afaik done per column, or per sample, scaling with respect to the overall variation of each sample. therefore should be:

Z.score <- scale(logCPM)

should be correct.

How the gene ENSG00000238009.6 has negative values in the first three samples and positive values in the sample 4 and sample 5?

You scale by row and adjusted row-means to = 0. If some values >0 then others have to be <0 to sum to 0

ADD COMMENT • link 5.6 years ago by Michael 54k

0

Entering edit mode

But in edgeR tutorial I see that t(scale(t(logCPM))) to show the expression data in heatmap. And again in this post [C: Adding column annotation to heatmap using Complexheatmap] Kevin used the same way to scale data to Z-scores.

ADD REPLY • link 5.6 years ago by Biologist ▴ 290

1

Entering edit mode

Might be, it just depends if you want to scale columns or rows, or both. That means, both ways you get a Z-score, one is per row, and one per column. Both normalizations make sense in some settings. Our confusion is rather about which scaling is commonly applied as "Z-score", but indeed both are.