Question: Z-score calculation from raw counts data
0
gravatar for Biologist
9 months ago by
Biologist150
Biologist150 wrote:

I have RNA-Seq data with raw counts in dataframe 'counts' having samples as columns and genes as rows. It looks something like below:

                           Sample1    Sample2       Sample3    Sample4       Sample5
ENSG00000243485.5            0            0            0            1            0
ENSG00000237613.2            0            0            0            0            0
ENSG00000238009.6           10            5            6           16           13
ENSG00000239945.1            0            0            0            0            0
ENSG00000239906.1            1            0            0            9            0
ENSG00000241860.6           16           65           19           56           82

And the sample annotation looks like below:

Samples      N_stage          M_stage
Sample1         N0               M1
Sample2         N1               M0
Sample3         N0               MX
Sample4         N2               M0
Sample5         N3               MX

I wanted to make a heatmap to show the sample stages. So, first I converted counts to logCPM.

logCPM <- cpm(counts, prior.count=2, log=TRUE)

logCPM looks like below:

                      Sample1      Sample2      Sample3      Sample4      Sample5
ENSG00000243485.5    0.3915381    0.3915381    0.3915381    1.0122705    0.3915381
ENSG00000237613.2    0.3915381    0.3915381    0.3915381    0.3915381    0.3915381
ENSG00000238009.6    2.7079614    2.1523650    2.3686665    3.6549466    3.5064449
ENSG00000239945.1    0.3915381    0.3915381    0.3915381    0.3915381    0.3915381
ENSG00000239906.1    0.8750013    0.3915381    0.3915381    2.9372348    0.3915381
ENSG00000241860.6    3.2731113    5.3940606    3.7562189    5.3507849    6.0161469

Converted logCPM to Z-score.

Z-Score <- t(scale(t(logCPM)))

                     Sample1       Sample2      Sample3      Sample4      Sample5
ENSG00000243485.5  -0.34352108  -0.34352108  -0.34352108   2.46775178  -0.34352108
ENSG00000237613.2  -0.07930516  -0.07930516  -0.07930516  -0.07930516  -0.07930516
ENSG00000238009.6  -0.40081457  -1.01397031  -0.77526011   0.64427767   0.48039136
ENSG00000239945.1   0.19393322   0.19393322   0.19393322   0.19393322   0.19393322
ENSG00000239906.1   0.11017422  -0.78784826  -0.78784826   3.94072835  -0.78784826
ENSG00000241860.6  -1.35704877   0.13799168  -1.01650998   0.10748697   0.57649536

Now I calculated IQR and selected top 10% genes for plotting. Is the above Z-score calculation is right? Little confused to understand. How the gene ENSG00000238009.6 has negative values in the first three samples and positive values in the sample 4 and sample 5? For the same gene logCPM is more than 2 in all samples. Why it is negative in Z-score?

zscore rna-seq logcpm heatmap • 560 views
ADD COMMENTlink modified 9 months ago by Michael Dondrup46k • written 9 months ago by Biologist150
0
gravatar for Michael Dondrup
9 months ago by
Bergen, Norway
Michael Dondrup46k wrote:

scale is generic function whose default method centers and/or scales the columns of a numeric matrix.

Now, if your columns are your samples, you would scale rows instead of columns. So no:

t(scale(t(logCPM)))

Z-score is afaik done per column, or per sample, scaling with respect to the overall variation of each sample. therefore should be:

Z.score <- scale(logCPM)

should be correct.

How the gene ENSG00000238009.6 has negative values in the first three samples and positive values in the sample 4 and sample 5?

You scale by row and adjusted row-means to = 0. If some values >0 then others have to be <0 to sum to 0

ADD COMMENTlink modified 9 months ago • written 9 months ago by Michael Dondrup46k

But in edgeR tutorial I see that t(scale(t(logCPM))) to show the expression data in heatmap. And again in this post [C: Adding column annotation to heatmap using Complexheatmap] Kevin used the same way to scale data to Z-scores.

ADD REPLYlink written 9 months ago by Biologist150
1

Might be, it just depends if you want to scale columns or rows, or both. That means, both ways you get a Z-score, one is per row, and one per column. Both normalizations make sense in some settings. Our confusion is rather about which scaling is commonly applied as "Z-score", but indeed both are.

ADD REPLYlink modified 9 months ago • written 9 months ago by Michael Dondrup46k

Thanks a lot for the reply.

ADD REPLYlink written 9 months ago by Biologist150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1009 users visited in the last hour