Using ColSums vs sizeFactors in read count normalization
3.4 years ago
tpaboh • 0

Hello,

I have a RNAseq data set with 10 samples. I noticed that I get slightly different fpkm values when I use colSums and sizeFactors for read count normalization. (See the following figure for )

My question is how to figure out which library size data use use for normalization? Does it depend on the personal preference?

Thank you!

RNA-Seq R DESeq2 fpkm • 1.6k views
3.4 years ago

There are already many questions adressing this issue. Search the terms median ratios method (the normalization used to calculate the size factors in DESeq, also called RLE) or between samples normalization.

In short, the median of ratios is a more robust normalization metric. In contrast, metrics based on total reads count (colSums as you said) are very sensitive to highly expressed genes, which can skew the normalization for all the other genes. You can read this review for instance that nicely illustrate the issue.

3.4 years ago

For finding DE genes, you should follow the regular DESeq2 protocol and not use FPKM.

For visualization purposes, 'correct' normalization is a little less important. Learn how each normalization works, and decide which way's assumptions better fit your data.

3.4 years ago
ATpoint 70k

What you refer to is naive per-million normalization vs RLE from DESeq2. Please read the DESeq paper which discusses why this technique exists and why it outperforms naive methods. Also, search pubmed for benchmarking papers towards normalization methods. They will all show that per-million is inferior. For a quick introduction see: