I am new to examining RNA-seq data sets and generating heatmaps. Currently, I am working with the Mean TPM dataset from all cell types published to https://dice-database.org.
I've looked at methods for handling zero values, and I currently do the following: - Find the smallest, non zero, TPM in the entire data set. - set a small number close to zero, but smaller than the smallest non-zero mean TPM - replace zero values with this very small number. - take the log(mean TPM), and do clustering with this transformed data set.
What I am wondering though, is if instead I should set an expression threshold based on a few housekeeping genes for my cell types of interest? I ask because there are subsets of my genes of interest where across all of my subtypes, there is background or no expression. While useful to understand where things are, it takes up space in these heatmaps! There other consideration is how to interpret some of these smaller values as real expression, or not, to get a better understanding of dynamic range.
Why not keep zeros as zeros? You can make heatmaps with zeros.
See an edit- I take the log(TPM) to make the heatmaps.
If the problem is log(0), then add a small number. Some people do log(x+0.001), some do as high as log(x+1) to avoid having negative values. For the purpose of a heatmap, you probably won't even notice the difference.