Question

Heatmap on Z scores data: scale or not?

0

Entering edit mode

3.8 years ago

camillab. ▴ 160

Hi,

I have previously converted different datasets into Zscores (zFPKM package) and one approach I want to try is to compare the different dataset is by:

identifying all shared genes
subdivide those genes into categories depending on the function (e.g. apoptotic markers, cytoskeleton) and do a heatmap on this last subgroup (it's one approach, not the only one I am testing).

I am assuming I do not have to scale prior to compute the heatmap since the data are already standardized. When I do the heatmap I obtain a heatmap "monocolor", suggesting that there aren't any differences between genes and between samples. I looked at the Z scores and the are not identical for all conditions. If I run the heatmap setting the scale (so scaling the Zscores) I obtain a better heatmap where at least I can see some differences but I don't think this makes sense. So my question is do I have to scale to perform heatmap/other analysis on already normalized data (Z scores)?

thank you

heatmap RNA-Seq Zscores • 4.4k views

ADD COMMENT • link updated 3.8 years ago by Shalu Jhanwar ▴ 520 • written 3.8 years ago by camillab. ▴ 160

1

Entering edit mode

Your z-scores may not be identical for all conditions but most of them may span only a small range thus not being distinguishable by colour. Rescaling adapts the colour range to the range of values which can make small differences more visible. Look at the distribution of your z-scores and try removing outliers. Consider this example where all matrix values are around 0 except one:

m <- matrix(jitter(rep(0,50)), nrow = 10, ncol = 5)
m[4,3] <- 1
heatmap(m, Rowv = NA, Colv = NA, scale = 'none')
heatmap(m, Rowv = NA, Colv = NA, scale = 'row')

ADD REPLY • link 3.8 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thank you very much! Can I ask you, if I want to run PCA I usually log transform and then I set center= T, scale=F since it's bulkRNAseq and I do not want to lose information. Do you think, based on the fact that the range is similar should I do the same in this case? or maybe just scale=T and not log transform? I have tried (for curiosity) to run a PCA without log transform and then setting scale= T or F and when I look at the screenplot for the PCs the PC1 is usually between 23-33% when I set scale=T (so it's not really represent the data) and if scale=F I got between 70-90% so I am not sure what it's the right approach cause, as before I would not log transform and not scaled

ADD REPLY • link 3.8 years ago by camillab. ▴ 160

score 1 · Answer 1 · 2020-07-15

1

Entering edit mode

3.8 years ago

Shalu Jhanwar ▴ 520

I'd suggest looking at the distribution of the z-scores. You might want to plot the datasets probably within a range (e.g. -3 to +3). You can play with the range based on the distribution of your data while plotting the heatmap. You should use divergent color maps for heatmap. You can also choose the color bar range wisely (e.g. cmap for python and scale_colour_gradient for ggplot2 in R) to show differences more effectively.

ADD COMMENT • link 3.8 years ago by Shalu Jhanwar ▴ 520

0

Entering edit mode

thank you, so plot within a range would mean exclude some genes/values outside the range chosen right? like if I have one sample that has an extremely high score for that specific gene, excluding it or how? Practically I have to subset my dataset in order to get only the genes/variables within that range. Am I right?

ADD REPLY • link 3.8 years ago by camillab. ▴ 160

1

Entering edit mode

In case you do not want to exclude any genes with extreme values, then you may consider transforming the values outside the range to the maximum values of the given range, just for plotting purpose. For e.g., if the plotting range is +3 to -3, then the genes with >3 score could be converted to 3 and so on.

ADD REPLY • link 3.8 years ago by Shalu Jhanwar ▴ 520

0

Entering edit mode

How to I select column (=genes) that have a specific range of values in rows (=samples)? le'ts say that I want values between -1 and 1 I tried this but it's not working and so far I can find only method to filter specific columns and I have too many columns:

library(dplyr)
new_frame<- Se_HC%>% filter(Se_HC %in% (-1:1))

ADD REPLY • link 3.8 years ago by camillab. ▴ 160