Can it be useful to depict variable genes using non-log transformed counts?
Entering edit mode
20 days ago
N15 ▴ 150

Hello all, I have been thinking about RNAseq heatmaps for a while now and would appreciate feedback from others. I am working with non-model organisms from messy microbiome datasets that don't work well for tools like deseq2/edgeR. It's difficult to determine what is "differentially expressed" in this case because I am not running actual tests. Rather, I am looking for qualitative differences that are consistent across many samples ( sort of "replicates"), and for this, clustered heatmaps are helpful. I have been library-normalizing data and plotting after log2 transformation.

I noticed that sorting by variance on log2-transformed data identifies weakly/moderately expressed genes that are highly variable across samples, and sorting on the non log2-transformed data will show less variable (but still variable) genes with higher baseline expression. I believe these latter genes are missed otherwise because calculating variance on log-transformed large numbers yields small variances (see useful write up by Friederike D√ľndar here:

I can't find much on the discussion boards or tutorials where people have actually used non-log-transformed data for the purpose of measuring variance. What are you thoughts on presenting data this way, if biologically it provides interesting results? I think the downside might be that you are interpreting genes that are not really variable, but merely abundant, though I think your heatmap would tell if you if that was the case (i.e., no clustering across samples, just noise).

variance rnaseq heatmap • 276 views
Entering edit mode
15 days ago
biomon ▴ 60

What do you mean by qualitative differences? So if I understand this right, you have many samples and you want to see what is consistent across all and find what can vary?

So I am assuming you are using and calcnormfactors and doing a f/rpkm or a cpm conversion?

I would do the lib size normalisation, and the log2 transformation. Then I would try these two approaches, before looking at non log data and go from there.

1) K-means clustering (kmeans()) (you need to specify the number of clusters, so play around with this), you need to standardise the expression too.

2) Perform a PCA, look at the top variable genes in PC1 and PC2 and visualise them to see if there is an interesting distribution. (eg library factoextra)

These two approaches should give you similar results.

Entering edit mode
5 days ago

High variance for low expressed genes, and low variance for high expressed gene is typical for Poisson/negbinomial distributions. It's called heteroscedasticity. If you use proper statistical methods that are based on this distributions, then it will take that into account. Otherwise you have to "moderate" the variance for low expressed genes. You can use VSN, or in many cases use a log10(a + counts) transform where a is some constant. Often a=1 is used, but a=10 or a=64 will stabilize your SD much more.

Entering edit mode

Thank you for the comment. Do you know what tools I can use to normalize for this? As I understand, variance stabilizing transformation (vst) within DESeq2 might help but I am not using DESeq2 for this analysis. Is there any harm in showing both log-normalized and non-log normalized heatmaps to account for either bias?


Login before adding your answer.

Traffic: 2175 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6