I'm analyzing metagenomic count data of open reading frames for several metagenomic samples. I am trying to think of an appropriate normalization method for this data, and have read that it is generally not advisable to simply divide counts by gene-length and a library size factor (i.e. using TPM counts for DESeq2/DE analysis). Why is this not advised? It would seem to allow for within and between sample comparisons and I don't see any intuitive drawbacks. Any help would be greatly appreciated, thank you!
not metatranscriptomic data, but rather metagenomic data. However you are correct that my tag is inappropriate for this question so I have changed it to "metagenomics"
Then I do not follow what you are trying to do. Why are trying to count metagenomic reads mapping to open reading frames? One would expect coverage to be (more or less) uniform along a contig, that is, all open reading frames over a contig would have the same approximate coverage. Differences in coverage are expected between different contigs, because these contigs could have originated from different organisms that are present at different abundances in the sample. But these differences reflect differences in organism abundance, not gene expression.
Gene length normalization is necessary when one wants to compare expression of different genes within a sample, and library size normalization, when one wants to compare expression of same gene between different samples. These normalizations were developed to tackle issues arising from RNAseq analysis.
It depends on the analysis you're doing. Normalization for gene length isn't done in edgeR or DESeq2 because for DEG analysis you are only comparing counts of the same gene across samples, so normalizing by gene length won't change the proportions of the numbers of the same gene across samples. If you're doing an analysis that will not be looking on a per-gene level and attempting to compare numbers between different genes, then it makes sense to normalize by gene length. Measurements like FPKM and TPM do normalize for both gene-length and library size, and many people try to overcome some of the shortcomings of FPKM/TPM by first calculating TPMs (don't use FPKM) and then applying TMM normalization (edgeR's normalization method) on the TPMs.
The problem with standard TPM values is that it doesn't account for differences in library composition. As an example, let's say we have 3 genes (A, B, C) that are expressed at equal levels in two samples, but the second sample also has a gene D that is expressed at a high level, but not expressed in the first sample. You can think of each read as reaching into a bag and pulling out a transcript, weighted by its expression level. Because gene D is in sample 2's bag with a high frequency, but not sample 1's, then we'll sequence fewer reads of A, B, and C from sample 2 compared to sample 1 even though they're expressed at the same level in both. TMM normalization attempts to account for this.
Another problem with normalizations like TPM and RPKM is they will flatten out the raw read counts. Let's say you have a project where one sample has a gene with TPM of 0.5, and another sample has TPM of 1.0 Do those mean something like 2 reads versus 4 reads? Or more like 40 reads versus 80 reads? A difference of a single read might very well be due to sampling error, a difference of 20 reads is more likely to be real. But you can't tell what's going on from the TPM alone; you've normalized away an important part of the data.
Normalizations like TPM or RPM might be alright for visualzations, but for finding differentially expressed genes, you should let DESeq2 or edgeR or whatever do their own normalization
1) It is advised that you divide by a normalization / size factor but this should not be a naive per-million factor that only corrects for sequencing depth but should also take library composition into account. This video illustrates why.
2) One can divide by size but it does not give any advantage unless you want to compare genes within the same sample. For between-sample comparison you would reduce the counts of longer genes and by this reduce their statistical power without adding any benefits. Between samples you compare one gene to its counterpart in a second sample so the total length is the same, therefore no correction necessary. What is necessary is to respect which isoforms are expressed. Say in sample A you have mostly expression of isoform A and B which are e.g. 20% longer than isoforms C and D which are mostly expressed in sample B => say the true gene expression was the same => sample A would still (probably) get higher counts for that gene as the isoforms are longer and therefore accumulate more reads. This needs correction to avoid length bias, which can e.g. be done with the tximport package. It introduces offsets for the linear model to correct for that length bias.
As you tagged your question with
rna-seq
, I think you have metatranscriptomic data, right?not metatranscriptomic data, but rather metagenomic data. However you are correct that my tag is inappropriate for this question so I have changed it to "metagenomics"
Then I do not follow what you are trying to do. Why are trying to count metagenomic reads mapping to open reading frames? One would expect coverage to be (more or less) uniform along a contig, that is, all open reading frames over a contig would have the same approximate coverage. Differences in coverage are expected between different contigs, because these contigs could have originated from different organisms that are present at different abundances in the sample. But these differences reflect differences in organism abundance, not gene expression.
Gene length normalization is necessary when one wants to compare expression of different genes within a sample, and library size normalization, when one wants to compare expression of same gene between different samples. These normalizations were developed to tackle issues arising from RNAseq analysis.