Question

Why you shouldn't normalize count data using gene-length AND library size for count analysis?

4

Entering edit mode

5.7 years ago

sundaresan.vinaik ▴ 40

I'm analyzing metagenomic count data of open reading frames for several metagenomic samples. I am trying to think of an appropriate normalization method for this data, and have read that it is generally not advisable to simply divide counts by gene-length and a library size factor (i.e. using TPM counts for DESeq2/DE analysis). Why is this not advised? It would seem to allow for within and between sample comparisons and I don't see any intuitive drawbacks. Any help would be greatly appreciated, thank you!

tpm normalization counts DESeq2 metagenomics • 11k views

ADD COMMENT • link 5.7 years ago by sundaresan.vinaik ▴ 40

0

Entering edit mode

metagenomic count data of open reading frames for several metagenomic samples.

As you tagged your question with rna-seq, I think you have metatranscriptomic data, right?

ADD REPLY • link 5.7 years ago by h.mon 35k

0

Entering edit mode

not metatranscriptomic data, but rather metagenomic data. However you are correct that my tag is inappropriate for this question so I have changed it to "metagenomics"

ADD REPLY • link 5.7 years ago by sundaresan.vinaik ▴ 40

0

Entering edit mode

Then I do not follow what you are trying to do. Why are trying to count metagenomic reads mapping to open reading frames? One would expect coverage to be (more or less) uniform along a contig, that is, all open reading frames over a contig would have the same approximate coverage. Differences in coverage are expected between different contigs, because these contigs could have originated from different organisms that are present at different abundances in the sample. But these differences reflect differences in organism abundance, not gene expression.

Gene length normalization is necessary when one wants to compare expression of different genes within a sample, and library size normalization, when one wants to compare expression of same gene between different samples. These normalizations were developed to tackle issues arising from RNAseq analysis.

ADD REPLY • link 5.7 years ago by h.mon 35k

score 6 · Answer 1 · 2019-11-04

It depends on the analysis you're doing. Normalization for gene length isn't done in edgeR or DESeq2 because for DEG analysis you are only comparing counts of the same gene across samples, so normalizing by gene length won't change the proportions of the numbers of the same gene across samples. If you're doing an analysis that will not be looking on a per-gene level and attempting to compare numbers between different genes, then it makes sense to normalize by gene length. Measurements like FPKM and TPM do normalize for both gene-length and library size, and many people try to overcome some of the shortcomings of FPKM/TPM by first calculating TPMs (don't use FPKM) and then applying TMM normalization (edgeR's normalization method) on the TPMs.

The problem with standard TPM values is that it doesn't account for differences in library composition. As an example, let's say we have 3 genes (A, B, C) that are expressed at equal levels in two samples, but the second sample also has a gene D that is expressed at a high level, but not expressed in the first sample. You can think of each read as reaching into a bag and pulling out a transcript, weighted by its expression level. Because gene D is in sample 2's bag with a high frequency, but not sample 1's, then we'll sequence fewer reads of A, B, and C from sample 2 compared to sample 1 even though they're expressed at the same level in both. TMM normalization attempts to account for this.

score 4 · Answer 2 · 2019-11-04

Another problem with normalizations like TPM and RPKM is they will flatten out the raw read counts. Let's say you have a project where one sample has a gene with TPM of 0.5, and another sample has TPM of 1.0 Do those mean something like 2 reads versus 4 reads? Or more like 40 reads versus 80 reads? A difference of a single read might very well be due to sampling error, a difference of 20 reads is more likely to be real. But you can't tell what's going on from the TPM alone; you've normalized away an important part of the data.

Normalizations like TPM or RPM might be alright for visualzations, but for finding differentially expressed genes, you should let DESeq2 or edgeR or whatever do their own normalization

score 4 · Answer 3 · 2019-11-04

There are two things:

1) It is advised that you divide by a normalization / size factor but this should not be a naive per-million factor that only corrects for sequencing depth but should also take library composition into account. This video illustrates why.

2) One can divide by size but it does not give any advantage unless you want to compare genes within the same sample. For between-sample comparison you would reduce the counts of longer genes and by this reduce their statistical power without adding any benefits. Between samples you compare one gene to its counterpart in a second sample so the total length is the same, therefore no correction necessary. What is necessary is to respect which isoforms are expressed. Say in sample A you have mostly expression of isoform A and B which are e.g. 20% longer than isoforms C and D which are mostly expressed in sample B => say the true gene expression was the same => sample A would still (probably) get higher counts for that gene as the isoforms are longer and therefore accumulate more reads. This needs correction to avoid length bias, which can e.g. be done with the tximport package. It introduces offsets for the linear model to correct for that length bias.