Can it be useful to depict variable genes using non-log transformed counts?
2
0
Entering edit mode
19 months ago
N15 ▴ 160

Hello all, I have been thinking about RNAseq heatmaps for a while now and would appreciate feedback from others. I am working with non-model organisms from messy microbiome datasets that don't work well for tools like deseq2/edgeR. It's difficult to determine what is "differentially expressed" in this case because I am not running actual tests. Rather, I am looking for qualitative differences that are consistent across many samples ( sort of "replicates"), and for this, clustered heatmaps are helpful. I have been library-normalizing data and plotting after log2 transformation.

I noticed that sorting by variance on log2-transformed data identifies weakly/moderately expressed genes that are highly variable across samples, and sorting on the non log2-transformed data will show less variable (but still variable) genes with higher baseline expression. I believe these latter genes are missed otherwise because calculating variance on log-transformed large numbers yields small variances (see useful write up by Friederike Dündar here: https://github.com/friedue/Notes/blob/master/RNA_heteroskedasticity.md).

I can't find much on the discussion boards or tutorials where people have actually used non-log-transformed data for the purpose of measuring variance. What are you thoughts on presenting data this way, if biologically it provides interesting results? I think the downside might be that you are interpreting genes that are not really variable, but merely abundant, though I think your heatmap would tell if you if that was the case (i.e., no clustering across samples, just noise).

variance rnaseq heatmap • 971 views
0
Entering edit mode
19 months ago
biomon ▴ 60

What do you mean by qualitative differences? So if I understand this right, you have many samples and you want to see what is consistent across all and find what can vary?

So I am assuming you are using and calcnormfactors and doing a f/rpkm or a cpm conversion?

I would do the lib size normalisation, and the log2 transformation. Then I would try these two approaches, before looking at non log data and go from there.

1) K-means clustering (kmeans()) (you need to specify the number of clusters, so play around with this), you need to standardise the expression too.

2) Perform a PCA, look at the top variable genes in PC1 and PC2 and visualise them to see if there is an interesting distribution. (eg library factoextra)

These two approaches should give you similar results.

0
Entering edit mode
18 months ago

High variance for low expressed genes, and low variance for high expressed gene is typical for Poisson/negbinomial distributions. It's called heteroscedasticity. If you use proper statistical methods that are based on this distributions, then it will take that into account. Otherwise you have to "moderate" the variance for low expressed genes. You can use VSN, or in many cases use a log10(a + counts) transform where a is some constant. Often a=1 is used, but a=10 or a=64 will stabilize your SD much more.

0
Entering edit mode

Thank you for the comment. Do you know what tools I can use to normalize for this? As I understand, variance stabilizing transformation (vst) within DESeq2 might help but I am not using DESeq2 for this analysis. Is there any harm in showing both log-normalized and non-log normalized heatmaps to account for either bias?

0
Entering edit mode

Please tell a bit more what you are using. If you use DESeq2 or edgeR you do not need to normalize because their method accounts for this. There is no physical harm doing both heatmap log-normalized or not log-normalized, just do not show the "not log-normalized" to me, you can show it to your mother. Ivo.

0
Entering edit mode

I am not using either of those tools. I am normalizing using RPKM which accounts for changes in library size across samples. Why do you think the "not log-normalized" variance-sorted approach is worthless, if log2 transforming is understood to bias against abundant genes? The heatmap would still be showing log2 transformed data... just highlighting a mostly different (but some consistent) subset of samples as the "top X variable".

0
Entering edit mode

hello Ivo, any other thoughts or recommendations would be appreciated. Thank you very much for your time.