Question: RNAseq analysis: what comes first, filtering or normalization
7 months ago
Herbert0 wrote:

Hi there

please excuse my very basic questions, but I was not able to find appropriate answers using searchengines.

I am trying to analyze a small dataset of the RNAseq of 3 vs 3 samples to identify differentially expressed genes and do some multivariate statistics. Due to the low sample size I chose to use EdgeR, but am a bit confused. In the package description ( all steps are nicely explained, but the order seems odd to me: they first describe filtering for low read counts, which in my samples removes quite a bit from the respective libraries, and then describe TMM normalization to account for the RNA composition effect.

Is this really the right order to do it, or am I confusing things?

So first:

data_edgeR <- DGEList(counts=data_matrix[2:46079,3:10], group=group) #create DGEList for further analyses

data_edgeR$samples #looking at library sizes before filtering

keep <- rowSums(cpm(data_edgeR)>1) >= 3
data_edgeR_filtered <- data_edgeR[keep, , keep.lib.sizes=FALSE]

and then

data_TMM_normalized <- calcNormFactors(data_edgeR_filtered)

Is this correct, or the other way ´round?

Many thanks!

rna-seq edger R • 483 views
modified 7 months ago • written 7 months ago by Herbert0
7 months ago
h.mon27k wrote:

Yes, it is the correct order. In general, the filtering removes quite a lot of genes, but a very small percentage of total counts - usually less than 1%. Did you compare total read count per sample pre- and pos-filtering?

written 7 months ago by h.mon27k

thx! Yes i checked, but found it difficult to estimate what is "much": From 7849976 to 7814960 for example

written 7 months ago by Herbert0

In the example you gave you are keeping more than 99.5% of the original reads - this isn't "much" filtering by any means, and it is just as expected.

written 7 months ago by h.mon27k

Perfect, thank you very much!

written 7 months ago by Herbert0
