Question

RNAseq Variance calculation

0

Entering edit mode

3.7 years ago

Rob ▴ 170

Hello friends I have RSEM normalized RNAseq data and I log2 transformed them to use in analysis. I want to get rid of low variance genes.

How should I calculate variance? Should I calculate Variance for RSEM data or it can be done for Log2 transformed data?

Thanks

RNA-Seq gene • 3.3k views

ADD COMMENT • link updated 3.7 years ago by ATpoint 81k • written 3.7 years ago by Rob ▴ 170

0

Entering edit mode

May I ask why you want to do that and what the final analysis goal is? I am reasonably sure there is a way of achieving your goal with standard tools so you do not need to write any custom code and leave the nitty griddy statistical details to the tool you use.

ADD REPLY • link 3.7 years ago by ATpoint 81k

0

Entering edit mode

I need to get rid of low variance genes(I calculate on excel). then I will use the remaining genes for differential expression analysis and the significant genes will be used for plotting heatmap.

For variance calculation, I m not sure if I use RSEM data or RSEM-log2 transformed data? Also, I think differential expression analysis needs to be done with normalized data(log2 transformed data). but DESeq2 or edgeR cannot do it with negative values. how can I deal with this?

ADD REPLY • link 3.7 years ago by Rob ▴ 170

1

Entering edit mode

Where have you read that you need to get rid of low variance genes? Filtering by variance is incompatible with all the empirical Bayes algorithms including limma, edgeR and DESeq2 because it introduces bias into the dispersion estimation step. Use instead the filtering recommended by those packages.

ADD REPLY • link 3.7 years ago by Gordon Smyth ★ 7.0k

score 5 · Answer 1 · 2020-08-20

I need to get rid of low variance genes(I calculate on excel). then I will use the remaining genes for differential expression analysis and the significant genes will be used for plotting heatmap.

Ah no, please do not do that, especially not in Excel. There are plenty of threads on what to do with log-normalized data. The basic rule that seems to be the current consensus is:

1) If you have raw counts use any of the established tools (DESeq2, edgeR, limma come to mind).

2) If you have normalized counts and no way to get the raw ones, use the limma trend pipeline (please read the limma manual for it).

3) Avoid any custom statistics/custom/arbitrary filtering unless you know exactly what you're doing.

4) Don't use Excel, it is not reproducible.

=> Use limma trend and please google for previous threads on that topic, biostars and support.biostars.org are full of advise on exactly that problem.

score 1 · Answer 2 · 2020-08-19

I think you should look into variance stabilizing transformations (VST) and regularized logarithm or rlog first, a great place to start is the DESeq2 vignette (http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html). VST and rlog both provide log2 transformed normalized values taking into account library size and normalization factors. This transformation is essential to account for the dependence of variance on the mean. You can then calculate gene-wise variance across all the genes, sort them, and decide which ones to remove.

score 1 · Answer 3 · 2020-08-19

R has var function to calculate variance, to apply on all rows it needs to be used like: apply(data, 1, var). If you want to try different kinds of filtering, create an ExpressionSet object of your data and use nsFilter function in genefilter package. I would suggest you to try genefilter package.