I have RNA-Seq data with raw counts in dataframe 'counts' having samples as columns and genes as rows. It looks something like below:
Sample1 Sample2 Sample3 Sample4 Sample5
ENSG00000243485.5 0 0 0 1 0
ENSG00000237613.2 0 0 0 0 0
ENSG00000238009.6 10 5 6 16 13
ENSG00000239945.1 0 0 0 0 0
ENSG00000239906.1 1 0 0 9 0
ENSG00000241860.6 16 65 19 56 82
And the sample annotation looks like below:
Samples N_stage M_stage
Sample1 N0 M1
Sample2 N1 M0
Sample3 N0 MX
Sample4 N2 M0
Sample5 N3 MX
I wanted to make a heatmap to show the sample stages. So, first I converted counts to logCPM.
logCPM <- cpm(counts, prior.count=2, log=TRUE)
logCPM looks like below:
Sample1 Sample2 Sample3 Sample4 Sample5
ENSG00000243485.5 0.3915381 0.3915381 0.3915381 1.0122705 0.3915381
ENSG00000237613.2 0.3915381 0.3915381 0.3915381 0.3915381 0.3915381
ENSG00000238009.6 2.7079614 2.1523650 2.3686665 3.6549466 3.5064449
ENSG00000239945.1 0.3915381 0.3915381 0.3915381 0.3915381 0.3915381
ENSG00000239906.1 0.8750013 0.3915381 0.3915381 2.9372348 0.3915381
ENSG00000241860.6 3.2731113 5.3940606 3.7562189 5.3507849 6.0161469
Converted logCPM to Z-score.
Z-Score <- t(scale(t(logCPM)))
Sample1 Sample2 Sample3 Sample4 Sample5
ENSG00000243485.5 -0.34352108 -0.34352108 -0.34352108 2.46775178 -0.34352108
ENSG00000237613.2 -0.07930516 -0.07930516 -0.07930516 -0.07930516 -0.07930516
ENSG00000238009.6 -0.40081457 -1.01397031 -0.77526011 0.64427767 0.48039136
ENSG00000239945.1 0.19393322 0.19393322 0.19393322 0.19393322 0.19393322
ENSG00000239906.1 0.11017422 -0.78784826 -0.78784826 3.94072835 -0.78784826
ENSG00000241860.6 -1.35704877 0.13799168 -1.01650998 0.10748697 0.57649536
Now I calculated IQR and selected top 10% genes for plotting. Is the above Z-score calculation is right? Little confused to understand. How the gene ENSG00000238009.6 has negative values in the first three samples and positive values in the sample 4 and sample 5? For the same gene logCPM is more than 2 in all samples. Why it is negative in Z-score?
But in edgeR tutorial I see that
t(scale(t(logCPM)))
to show the expression data in heatmap. And again in this post [C: Adding column annotation to heatmap using Complexheatmap] Kevin used the same way to scale data to Z-scores.Might be, it just depends if you want to scale columns or rows, or both. That means, both ways you get a Z-score, one is per row, and one per column. Both normalizations make sense in some settings. Our confusion is rather about which scaling is commonly applied as "Z-score", but indeed both are.
Thanks a lot for the reply.