Question

TMM-Normalization

1

Entering edit mode

3.0 years ago

Filago ▴ 90

Hi, I want to calculate tissue specificity (e.g. z-scores) between GTEx tissues. I downloaded the raw counts, performed TMM normalization with edgeR and obtained raw counts. Next step in the pipeline would be to normalize to cpm counts. However, for some tissues like e.g. pancreas this strategy seems weird, as those tissues have a very small number of genes dramatically overexpressed (reducing CPM/TPM of other genes in those organs (dramatically)). An alternative that I am thinking about is to simply adjust my raw counts by normalized library sizes and go on with analysis.

What do you think about it?

Best, Andreas

TMM RNAseq-Normalization • 5.2k views

ADD COMMENT • link 2.5 years ago by Filago ▴ 90

score 11 · Answer 1 · 2021-04-19

I see no reason to not use edgeR here. The point of TMM is exactly to avoid skewed normalization due to differences in composition. The process of edgeR::calcNormFactors followed by edgeR::cpm is to first calculate factors that correct for composition, combine them with the total library sizes into effective library sizes, and then apply the derived size factors to the raw counts. Be sure to first make a DGEList object:

y <- DGEList(your.counts)
y <- calcNormFactors(y)
cpm(y)

...as instructed in the manual. This will make sure the norm factors are being used properly.

The point of reduced counts that you mention is only relevant if you used a naive per-million scaling. The norm factors as said above will further correct the library size-scaled counts to correct for that.

But there is never a guarantee the method really captures differences if they are globally different between groups. You can diagnose this using MA-plots to see whether the bulk of genes gets somewhat properly centered at y=0. You could first average all samples of one tissue and then plot some tissues against each other on the MAs to see how the normalization performed.

These concerns are not theoretical, lets check it on the GTEx data. Code for the plot is below, here I obtained the raw counts from recount, and then looked at Lung vs Pancreas, either normalizing only by library size (here called naive) and by TMM.

MAplots

As you can see the naive method fails to properly center the bulk of genes (the very blue ones, this is a density plot, so very blue means lots of data points points) whereas TMM does the trick and corrects for the biased library composition. The top and bottom dashed lines indicate a fold change of 2 (on log2 scale), clearly the naive method suffers from a compositional bias which is not being accounted for.

Code: