Question

How important is normalization for RNA_Seq data ?

1

Entering edit mode

3.9 years ago

sunnykevin97 ▴ 980

HI, I need to clarify myself clearly before proceeding towards differential gene expression analysis. I assembled all the transcripts from ~100 RNA_Seq datasets. I quantified using salmon and generated quant.sf file as an output. Does normalization is necessary step before proceeding differential gene expression analysis step. But, I used CDHit-EST for removing redundant transcripts before proceeding quantification step.

Suggestions please!

Thanks

RNA-Seq assembly R • 1.7k views

ADD COMMENT • link updated 3.9 years ago by i.sudbery 19k • written 3.9 years ago by sunnykevin97 ▴ 980

score 4 · Answer 1 · 2020-06-16

There are two/three different meanings of the term "normalisation" here that I think might be causing confusion.

The first use of normalization is a process applied to remove redundancy in De Brujin graph assembly. It is connected to estimating which transcripts are present, not in what quantity they are present. it is used in conjunction with de-novo assembly tools such as Trinity. I think this is what you are referring to when you ask whether you need to normalize even though you have removed redundant transcripts. I believe such normalization is built into the tools these days and you don't have to worry about whether to use it or not.

The second meaning, closely related to the first, is the idea of removing duplicate reads from a dataset because they might be created by PCR duplication of the same original read. Research continues on this problem, but in general the advice is not to de-duplicate(/normalise) RNA-seq data.

The final meaning is the one @ATPoint refers to above, it involves making sure that the same count in two different samples means the same thing. This normalization is crucial and you absolutely must do it. Its obvious to see that if you sequenced 1 million reads in one sample, an got back 100,000 reads from a gene, and sequenced 100,000 reads in another sample, and got back 10,000 reads for the same gene, you wouldn't want to directly compare 100,000 to 10,000. Hence the need to normalize. Luckily this sort of normalization is very quick and easy in modern RNAseq analysis packages.

score 2 · Answer 2 · 2020-06-16

2

Entering edit mode

3.9 years ago

ATpoint 81k

Yes, it is necessary, even critical and must not be skipped. It eliminates the confounding effect of library size differences (that is different sequencing depth) and library composition. For more details what that is and how normalization in e.g. DESeq2 works see

Don't worry about it. Standard workflows like DESeq2 or edgeR will handle normalization internally, there is not much you have to do. In fact you do not even need to understand it in detail as long as you follow the default workflows. See e.g. DESeq2 or edgeR manual.