Question

TMM normalization factors in RNA-seq analysis

2

Entering edit mode

6.7 years ago

sarahmanderni ▴ 100

Hi,

To my understanding, the main aim in TMM normalization is to account for library size variation between samples of interest. I have a simulated RNA-seq data with equal library sizes for all samples. I ran TMM normalization and I expected to find all normalization factors (from calcNormFactors() function) equal to one. However, the factors vary from 0.4 to 2.4 (with median of 1 of course) and this is not what I expect. Have I misunderstood something here? Another question is can I use TMM normalization for non-binomial values? for instance over TPM values?

Thanks in advance!

RNA-Seq TMM normalization • 27k views

ADD COMMENT • link updated 6.7 years ago by h.mon 35k • written 6.7 years ago by sarahmanderni ▴ 100

2

Entering edit mode

Exactly how did you simulate the data. TMM is a robust measure, so if you produced very different distributions of reads then that'd be the cause.

ADD REPLY • link 6.7 years ago by Devon Ryan 104k

0

Entering edit mode

To my knowledge TMM is supposed to correct mostly for composition bias (as well as library size). If you generated samples with different compositions then it's correct that the normalisation factors would vary.

ADD REPLY • link 6.7 years ago by James Ashmore ★ 3.4k

0

Entering edit mode

I have nt produced the data myself; but yes the distribution of the reads vary significantly. Can you elaborate a little more what do you mean by robust measure and in what way the distribution affects?

ADD REPLY • link 6.7 years ago by sarahmanderni ▴ 100

score 6 · Answer 1 · 2017-08-29

6

Entering edit mode

6.7 years ago

h.mon 35k

The main aim in TMM normalization is to account for library size variation between samples of interest, accounting for the fact that some extremely differentially expressed genes would impact negatively the normalization procedure - or as Devon Ryan said, it is a robust normalization. How does it achieve its robustness? From the paper:

A trimmed mean is the average after removing the upper and lower x% of the data.

So an assumption of TMM is the majority of the genes are not differentially expressed. And as Devon pointed, different distributions of gene expression will result in different TMM normalizations.

ADD COMMENT • link 6.5 years ago by h.mon 35k

0

Entering edit mode

Makes sense. Will check the paper again, thanks.

ADD REPLY • link 6.7 years ago by sarahmanderni ▴ 100

0

Entering edit mode

Do you have experience of applying it over TPM values?

ADD REPLY • link 6.7 years ago by sarahmanderni ▴ 100

1

Entering edit mode

I have none, but it seems you can do it (yes, you can).

ADD REPLY • link 6.7 years ago by h.mon 35k

0

Entering edit mode

I am also confused about normalization and statistics behind DE programs, I am using edgeR to analize two condittions.

Example for a gene ( raw-counts) four replicates by condition control (C) tratmeat (T) of a gene:

gene= FBgn0034710

Controles = 820 1618 1728 1007

Tratamientos= 7195 1252 1312 1291

Result of edgeR

logFC =1.10
logCPM = 6.5 LR = 9.77 PValue = 0.0017
FDR= 0.02

Why FBgn0034710 gene is statistically significant if one replicate (7195) has a lot of raw count in comparation with the others. I know that library size could be a factor but this is similar in the other replicates

ADD REPLY • link 6.5 years ago by vm.higareda ▴ 30