Why you shouldn't normalize count data using gene-length AND library size for count analysis?
3
1
Entering edit mode
17 months ago

I'm analyzing metagenomic count data of open reading frames for several metagenomic samples. I am trying to think of an appropriate normalization method for this data, and have read that it is generally not advisable to simply divide counts by gene-length and a library size factor (i.e. using TPM counts for DESeq2/DE analysis). Why is this not advised? It would seem to allow for within and between sample comparisons and I don't see any intuitive drawbacks. Any help would be greatly appreciated, thank you!

tpm normalization counts DESeq2 metagenomics • 2.0k views
ADD COMMENT
0
Entering edit mode

metagenomic count data of open reading frames for several metagenomic samples.

As you tagged your question with rna-seq, I think you have metatranscriptomic data, right?

ADD REPLY
0
Entering edit mode

not metatranscriptomic data, but rather metagenomic data. However you are correct that my tag is inappropriate for this question so I have changed it to "metagenomics"

ADD REPLY
0
Entering edit mode

Then I do not follow what you are trying to do. Why are trying to count metagenomic reads mapping to open reading frames? One would expect coverage to be (more or less) uniform along a contig, that is, all open reading frames over a contig would have the same approximate coverage. Differences in coverage are expected between different contigs, because these contigs could have originated from different organisms that are present at different abundances in the sample. But these differences reflect differences in organism abundance, not gene expression.

Gene length normalization is necessary when one wants to compare expression of different genes within a sample, and library size normalization, when one wants to compare expression of same gene between different samples. These normalizations were developed to tackle issues arising from RNAseq analysis.

ADD REPLY
1
Entering edit mode
17 months ago
colin.kern ▴ 970

It depends on the analysis you're doing. Normalization for gene length isn't done in edgeR or DESeq2 because for DEG analysis you are only comparing counts of the same gene across samples, so normalizing by gene length won't change the proportions of the numbers of the same gene across samples. If you're doing an analysis that will not be looking on a per-gene level and attempting to compare numbers between different genes, then it makes sense to normalize by gene length. Measurements like FPKM and TPM do normalize for both gene-length and library size, and many people try to overcome some of the shortcomings of FPKM/TPM by first calculating TPMs (don't use FPKM) and then applying TMM normalization (edgeR's normalization method) on the TPMs.

The problem with standard TPM values is that it doesn't account for differences in library composition. As an example, let's say we have 3 genes (A, B, C) that are expressed at equal levels in two samples, but the second sample also has a gene D that is expressed at a high level, but not expressed in the first sample. You can think of each read as reaching into a bag and pulling out a transcript, weighted by its expression level. Because gene D is in sample 2's bag with a high frequency, but not sample 1's, then we'll sequence fewer reads of A, B, and C from sample 2 compared to sample 1 even though they're expressed at the same level in both. TMM normalization attempts to account for this.

ADD COMMENT
1
Entering edit mode
17 months ago
swbarnes2 9.7k

Another problem with normalizations like TPM and RPKM is they will flatten out the raw read counts. Let's say you have a project where one sample has a gene with TPM of 0.5, and another sample has TPM of 1.0 Do those mean something like 2 reads versus 4 reads? Or more like 40 reads versus 80 reads? A difference of a single read might very well be due to sampling error, a difference of 20 reads is more likely to be real. But you can't tell what's going on from the TPM alone; you've normalized away an important part of the data.

Normalizations like TPM or RPM might be alright for visualzations, but for finding differentially expressed genes, you should let DESeq2 or edgeR or whatever do their own normalization

ADD COMMENT
1
Entering edit mode
17 months ago
ATpoint 47k

There are two things:

1) It is advised that you divide by a normalization / size factor but this should not be a naive per-million factor that only corrects for sequencing depth but should also take library composition into account. This video illustrates why.

2) One can divide by size but it does not give any advantage unless you want to compare genes within the same sample. For between-sample comparison you would reduce the counts of longer genes and by this reduce their statistical power without adding any benefits. Between samples you compare one gene to its counterpart in a second sample so the total length is the same, therefore no correction necessary. What is necessary is to respect which isoforms are expressed. Say in sample A you have mostly expression of isoform A and B which are e.g. 20% longer than isoforms C and D which are mostly expressed in sample B => say the true gene expression was the same => sample A would still (probably) get higher counts for that gene as the isoforms are longer and therefore accumulate more reads. This needs correction to avoid length bias, which can e.g. be done with the tximport package. It introduces offsets for the linear model to correct for that length bias.

ADD COMMENT

Login before adding your answer.

Traffic: 1658 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6