Question

RNAseq analysis: what comes first, filtering or normalization

1

Entering edit mode

5.3 years ago

Herbert ▴ 10

Hi there

please excuse my very basic questions, but I was not able to find appropriate answers using searchengines.

I am trying to analyze a small dataset of the RNAseq of 3 vs 3 samples to identify differentially expressed genes and do some multivariate statistics. Due to the low sample size I chose to use EdgeR, but am a bit confused. In the package description (https://www.bioconductor.org/packages/devel/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf) all steps are nicely explained, but the order seems odd to me: they first describe filtering for low read counts, which in my samples removes quite a bit from the respective libraries, and then describe TMM normalization to account for the RNA composition effect.

Is this really the right order to do it, or am I confusing things?

So first:

data_edgeR <- DGEList(counts=data_matrix[2:46079,3:10], group=group) #create DGEList for further analyses

data_edgeR$samples #looking at library sizes before filtering

keep <- rowSums(cpm(data_edgeR)>1) >= 3
data_edgeR_filtered <- data_edgeR[keep, , keep.lib.sizes=FALSE]

and then

data_TMM_normalized <- calcNormFactors(data_edgeR_filtered)

Is this correct, or the other way ´round?

Many thanks!

R edgeR RNA-Seq • 3.7k views

ADD COMMENT • link updated 5.3 years ago by h.mon 35k • written 5.3 years ago by Herbert ▴ 10

score 4 · Answer 1 · 2018-12-29

4

Entering edit mode

5.3 years ago

h.mon 35k

Yes, it is the correct order. In general, the filtering removes quite a lot of genes, but a very small percentage of total counts - usually less than 1%. Did you compare total read count per sample pre- and pos-filtering?

ADD COMMENT • link 5.3 years ago by h.mon 35k

0

Entering edit mode

thx! Yes i checked, but found it difficult to estimate what is "much": From 7849976 to 7814960 for example

ADD REPLY • link 5.3 years ago by Herbert ▴ 10

0

Entering edit mode

In the example you gave you are keeping more than 99.5% of the original reads - this isn't "much" filtering by any means, and it is just as expected.