I have RNA-Seq data with raw counts in dataframe 'counts' having samples as columns and genes as rows. It looks something like below:
Sample1 Sample2 Sample3 Sample4 Sample5
ENSG00000243485.5 0 0 0 1 0
ENSG00000237613.2 0 0 0 0 0
ENSG00000238009.6 10 5 6 16 13
ENSG00000239945.1 0 0 0 0 0
ENSG00000239906.1 1 0 0 9 0
ENSG00000241860.6 16 65 19 56 82
And the sample annotation looks like below:
Samples N_stage M_stage
Sample1 N0 M1
Sample2 N1 M0
Sample3 N0 MX
Sample4 N2 M0
Sample5 N3 MX
I wanted to make a heatmap to show the sample stages. So, first I converted counts to logCPM.
logCPM <- cpm(counts, prior.count=2, log=TRUE)
logCPM looks like below:
Sample1 Sample2 Sample3 Sample4 Sample5
ENSG00000243485.5 0.3915381 0.3915381 0.3915381 1.0122705 0.3915381
ENSG00000237613.2 0.3915381 0.3915381 0.3915381 0.3915381 0.3915381
ENSG00000238009.6 2.7079614 2.1523650 2.3686665 3.6549466 3.5064449
ENSG00000239945.1 0.3915381 0.3915381 0.3915381 0.3915381 0.3915381
ENSG00000239906.1 0.8750013 0.3915381 0.3915381 2.9372348 0.3915381
ENSG00000241860.6 3.2731113 5.3940606 3.7562189 5.3507849 6.0161469
Converted logCPM to Z-score.
Z-Score <- t(scale(t(logCPM)))
Sample1 Sample2 Sample3 Sample4 Sample5
ENSG00000243485.5 -0.34352108 -0.34352108 -0.34352108 2.46775178 -0.34352108
ENSG00000237613.2 -0.07930516 -0.07930516 -0.07930516 -0.07930516 -0.07930516
ENSG00000238009.6 -0.40081457 -1.01397031 -0.77526011 0.64427767 0.48039136
ENSG00000239945.1 0.19393322 0.19393322 0.19393322 0.19393322 0.19393322
ENSG00000239906.1 0.11017422 -0.78784826 -0.78784826 3.94072835 -0.78784826
ENSG00000241860.6 -1.35704877 0.13799168 -1.01650998 0.10748697 0.57649536
Now I calculated IQR and selected top 10% genes for plotting. Is the above Z-score calculation is right? Little confused to understand. How the gene ENSG00000238009.6 has negative values in the first three samples and positive values in the sample 4 and sample 5? For the same gene logCPM is more than 2 in all samples. Why it is negative in Z-score?