Question

How do you generate TMM normalized counts using EdgeR?

3

Entering edit mode

2.9 years ago

Pratik ★ 1.0k

Hi guys,

First I just want to say, I know this has been asked numerous times and in a number of places. However my confusion has increased progressively!

If someone could set the record straight, please, on how to generate a TMM normalized counts using EdgeR, I would be incredibly grateful!

This post has two answers that look like disagree with one another... output TMM normalized counts with edgeR

The main reason I want to generate TMM matrix from my count matrix is to compare gene expression levels within samples (see the level of multiple cell-type markers within a single sample) in a bar plot, and then generate the same bar plot for every other respective sample to compare between samples.

I think using TMM normalized counts will allow me to compare between samples and within samples according to here: https://hbctraining.github.io/Training-modules/planning_successful_rnaseq/lessons/sample_level_QC.html

I plan to use something like this to generate the bar plots: Network/Pathway Analysis from Mass Spec data

Any help would be appreciated.

Thank you in advance :)

R RNA-seq edgeR • 13k views

ADD COMMENT • link updated 28 days ago by inedraylig ▴ 60 • written 2.9 years ago by Pratik ★ 1.0k

score 25 · Accepted Answer · 2021-06-13

Sorry this has been confusing. This has also been a regular source of frustration for the edgeR authors as well because we have been saying the same thing for a decade:

If you want to export normalized expression values out of edgeR, just use cpm or rpkm.

The root of the confusion is that there is no such thing as a "TMM normalized count" because TMM normalizes the library sizes rather than the counts. And I have always resisted pressure to use the term "normalized count" in the edgeR documentation because a normalized value can no longer be a count. I prefer to use more descriptive and specific terms like cpm or rpkm. I know that other software tools refer to "normalized counts" but I find that unhelpful. Normalized for what?

TMM normalizes the library sizes to produce effective library sizes. cpm values are counts normalized by the effective library sizes. rpkm values are counts normalized by effective library sizes and by gene/feature length.

A second source of confusion is that people seem to assume that edgeR must be storing "normalized counts" internally somehow, but it does not. Most edgeR DE pipelines never modify the original counts in any way. Normalization for library size is instead implicit as part of the model-fitting. edgeR does not use cpm or rpkm values internally in its DE pipelines, rather they are only for export or for graphical purposes.

A third source of confusion is that the original edgeR pipeline (now called the "classic" pipeline) did compute pseudo.counts internally, which are equivalent to the original counts but with equalized effective library sizes. The pseudo.counts were used only to estimate dispersions, not to assess DE or to compute fold-changes. We did not intend or recommend that users would export these as normalized values but some have done so. In any case, one cannot multiple pseudo.counts by norm.factors as one of the previous answers you link to suggests.

Examples of posts by the edgeR authors:

score 9 · Accepted Answer · 2021-06-13

The default normalization in edgeR can be broken down to two steps:

1) normalization by library size. That is simply the correction for read depth. While this may probably be good enough when there are no widespread changes in library compisition (=samples are very similar and only very few genes are differential), this often is not good enough. See for an example my answer here (TMM-Normalization) using GTEx data where I compare pancreas and lung transcriptomes, so one would expect notably different gene expression profiles. As you'll see plain per-million scaling results in biased normalized counts while TMM manages to properly center the bulk of genes at y=0 in the MA-plot.

2) the introduction of normalization factors that correct the library size-scaled values for the compositional component. This here is what the Trimmed Mean of M-values (TMM) does. For technical details see the original paper by Robinson & Oshlack in Genome Biology from 2010.

Points 1) and 2) are then combined to calculate the effective library size which is then used to divide the raw counts by to obtain normalized counts, also often referred to as TMM-normalized counts or cpm.

In practice:

#/ make the DGEList:
y <- DGEList(...)

#/ calculate TMM normalization factors:
y <- calcNormFactors(y)

#/ get the normalized counts:
cpms <- cpm(y, log=FALSE)

The cpm function uses the normalization factors (given that calcNormFactors was run on that DGEList) internally. If not, then cpm just return the plain per-million scaled factors.