Question

RNAseq: Filtering before or after transformation?

0

Entering edit mode

22 months ago

plicht ▴ 20

I have an RNAseq dataset that I want to filter strongly on the 5000 most variable genes. What I want to do is:

perform Size estimation with DESeq2::estimateSizeFactor

transform to gaussion distribution with DESeq2::rlog

filter the most variable genes with rowVars

Do I perform the filtering step before or after the transformation step? I tried both and it gave me varying results.

DESeq2 transformation RNAseq filtering • 1.3k views

ADD COMMENT • link updated 22 months ago by ATpoint 82k • written 22 months ago by plicht ▴ 20

0

Entering edit mode

@ATpoint: somehow your answer is not displayed under this thread, but only in my private notifications:

"You would filter for these genes after the transformation because the whole point of the transformation is to unlock the dependency of the variance from the mean (so from the expression level), as you want to filter for "biologically variable" genes that are different between samples and not for high variance due to expression level (which is technical)."

Didn't I account for technical variation with the SizeFactor already? I thought transformation is used to meet the requirement of gaussion distribution of most statistical tests and not to normalize for technical biases. As such, I would expect to have a strong agreement of the most variable genes either way they are computed form sizeFactor normalized transformed or untransformed counts.

ADD REPLY • link 22 months ago by plicht ▴ 20

1

Entering edit mode

The normalization via size factors accounts for differences in sequencing depth and library composition. The log2 is necessary (or vst/rlog) to remove dependency of variance from mean, see answer from @yoogstrate and my comment.

See also for the normalization itself:

ADD REPLY • link 22 months ago by ATpoint 82k

score 2 · Answer 1 · 2022-06-14

2

Entering edit mode

22 months ago

yhoogstrate ▴ 140

After, one of the reasons you transform is to stabilize variance. Estimating variance after transformation is more reliable. There's some theory about this in the DESeq manual and in one of the presentations of Simon Anders:

https://bioconductor.org/help/course-materials/2014/CSAMA2014/2_Tuesday/lectures/DESeq2-Anders.pdf

ADD COMMENT • link 22 months ago by yhoogstrate ▴ 140

2

Entering edit mode

You can nicely see it with this simple code. Without transformation (here I just use log2) the variance is almost linear to the mean of expression, the transformation removes that bias:

library(DESeq2)

dds <- makeExampleDESeqDataSet(n=5000)
dds <- estimateSizeFactors(dds) 

norm <- counts(dds, normalized=TRUE)
ntd  <- log2(norm+1) # normalized counts and log2 scale

#/ Without transformation variance is somewhat linear to the mean of expression
plot(x=log10(rowMeans(norm)+1), y=log10(rowVars(norm)+1), pch=20)

#/ with transformation that is unlocked
plot(x=rowMeans(ntd), y=rowVars(ntd), pch=20)

enter image description here

ADD REPLY • link 22 months ago by ATpoint 82k

0

Entering edit mode

thanks for clarification! The pdf helped a lot

ADD REPLY • link 22 months ago by plicht ▴ 20