Hello all, I have been thinking about RNAseq heatmaps for a while now and would appreciate feedback from others. I am working with non-model organisms from messy microbiome datasets that don't work well for tools like deseq2/edgeR. It's difficult to determine what is "differentially expressed" in this case because I am not running actual tests. Rather, I am looking for qualitative differences that are consistent across many samples ( sort of "replicates"), and for this, clustered heatmaps are helpful. I have been library-normalizing data and plotting after log2 transformation.
I noticed that sorting by variance on log2-transformed data identifies weakly/moderately expressed genes that are highly variable across samples, and sorting on the non log2-transformed data will show less variable (but still variable) genes with higher baseline expression. I believe these latter genes are missed otherwise because calculating variance on log-transformed large numbers yields small variances (see useful write up by Friederike Dündar here: https://github.com/friedue/Notes/blob/master/RNA_heteroskedasticity.md).
I can't find much on the discussion boards or tutorials where people have actually used non-log-transformed data for the purpose of measuring variance. What are you thoughts on presenting data this way, if biologically it provides interesting results? I think the downside might be that you are interpreting genes that are not really variable, but merely abundant, though I think your heatmap would tell if you if that was the case (i.e., no clustering across samples, just noise).