Question

Why employ normalization methods, and how can they be utilized in DEG analysis?

2

Entering edit mode

12 weeks ago

wyt1995 ▴ 30

I have learned that normalization methods such as CPM (counts per million mapped reads), TPM (transcripts per million), FPKM (fragments per kilobase of transcript per million mapped reads), and RPKM (reads per kilobase of transcript per million mapped reads) are employed to normalize the expression profile of RNA-seq data. The purpose of normalization is to facilitate the comparison of gene expression by eliminating biases. However, these methods cannot be directly used with limma, edgeR, or DESeq2, which are tools designed for the analysis of Differentially Expressed Genes (DEGs). Despite this limitation, the question arises as to why we use these normalization methods or if there is a way to incorporate them into DEG analysis using limma, edgeR, or DESeq2.

If there is any information available about the solution before posting, kindly provide the links. I would appreciate it. Thank you.

R DEGs normalization • 593 views

ADD COMMENT • link 6 weeks ago by wyt1995 ▴ 30

score 5 · Answer 1 · 2024-02-02

Firstly limma, edgeR and DESeq2 do have normalization, its just not CPM/TPM/FPKM normalisation. There are three big differences between CPM/TPM/FPKM and the normalisation methods used by edgeR and DESeq2:

TPM/FPKM normalise for gene length, and limma/edgeR/DESeq2 do not.
CPM/TPM/FPKM are (more) susceptable to composition effects.
edgeR/DESeq2 normalisation leaves counts on an integer scale, while CPM/TPM/FPKM puts things on a continuous scale.

Thus, we use TPM/FPKM when length matters, and we use edgeR/DEseq2 when being integer matters. As TPM/FPKM are very poor at dealing with compositional effects, we use edgeR/DESeq2/limma, all other things being equal.

First, very roughtly, we can start by saying that the number of reads observed for a gene g in a sample s is a product of several factors:

counts = length(g) * expression_level(g,s) * library_size(s)

We are generally interested in expression_level, but longer genes generate more counts for the same expression_level, and if you sequence 100 million reads then you will have higher counts than if you sequence 10 million reads.

When do we need integers?

Basically, we need integers any time we do any statistics that use a discrete probability distribution. The most obvious example here is differential gene expression using the negative binomial distribution, as implemented by edgeR and DESeq2. The negative binomial distribution tells us how likely it is to get X many reads from a gene given a certain expression level. I will tell us how that compares to X-1 or X+1, but it can't tell us the probability of X+0.5 or X+0.345345. And thats not just a requirement of the formula that can be overcome by rounding to the nearest integer - the numbers have to behave like they were generated by an count generating process, not just be whole numbers - this isn't the case when you divide a count by a gene length.

When do we need to correct for gene length?

If the number of reads for a gene in a sample is dependent on the length of the gene, then why do we not need to correct for gene length when doing differential gene expression? Because we are comparing two counts from the same gene. We are calculating a fold change:

fold_change = length(g) * expression_level(g,sample1) * library_size(sample1) / length(g) * expression_level(g,sample2) * library_size(sample2)`
            = expression_level(g,sample1) * library_size(sample1)/expression_level(g, sample2) * library_size(sample2)

That is, the length is present on both the top and the bottom, so cancels out.

So when do we want to correct for length? When you are comparing two different genes. If I told you that I got 1000 reads from GAPDH and 2000 reads from ACTB, wouldn't know if that was because GAPDH was shorter, or because it was less highliy expressed.

Compositional effects?

Compositional effects arise because the number of reads is not actaully proportional the expression level of a gene, but the fraction of transcription that comes from a gene. if you have two genes in a sample, and say one is expressed at 499,000 units and the other at 1,000 and you sequence 1,000,000 reads, then you'll get 998,000 reads from gene 1 and 2,000 reads from gene 2. Now if you double the expression level of gene 1 to 998,000, but leave gene 2 where it is, you'll get 999,000 reads from gene 1, and only 1,000 reads from gene 2: doubling the expression of gene 1 has made it look like the expression of gene 2 has halved.

FPKM/CPM/TPM just divide by the total number of reads, and so are very suceptable to this. edgeR, DESeq2 and limma use more sophisticated methods for normalising for library size. They still suffer from this a bit, and it often introduces other assumptions, but they generally mitigate at least some of this effect.

There are plenty of papers about normalisation of RNAseq data, but you could probably do worse than starting with the original DESeq and edgeR papers.