Question: TMM normalization factors in RNA-seq analysis
1
gravatar for sarahmanderni
21 months ago by
sarahmanderni70 wrote:

Hi,

To my understanding, the main aim in TMM normalization is to account for library size variation between samples of interest. I have a simulated RNA-seq data with equal library sizes for all samples. I ran TMM normalization and I expected to find all normalization factors (from calcNormFactors() function) equal to one. However, the factors vary from 0.4 to 2.4 (with median of 1 of course) and this is not what I expect. Have I misunderstood something here? Another question is can I use TMM normalization for non-binomial values? for instance over TPM values?

Thanks in advance!

rna-seq tmm normalization • 8.0k views
ADD COMMENTlink modified 21 months ago by h.mon25k • written 21 months ago by sarahmanderni70
2

Exactly how did you simulate the data. TMM is a robust measure, so if you produced very different distributions of reads then that'd be the cause.

ADD REPLYlink written 21 months ago by Devon Ryan90k

To my knowledge TMM is supposed to correct mostly for composition bias (as well as library size). If you generated samples with different compositions then it's correct that the normalisation factors would vary.

ADD REPLYlink written 21 months ago by James Ashmore2.6k

I have nt produced the data myself; but yes the distribution of the reads vary significantly. Can you elaborate a little more what do you mean by robust measure and in what way the distribution affects?

ADD REPLYlink written 21 months ago by sarahmanderni70
4
gravatar for h.mon
21 months ago by
h.mon25k
Brazil
h.mon25k wrote:

The main aim in TMM normalization is to account for library size variation between samples of interest, accounting for the fact that some extremely differentially expressed genes would impact negatively the normalization procedure - or as Devon Ryan said, it is a robust normalization. How does it achieve its robustness? From the paper:

A trimmed mean is the average after removing the upper and lower x% of the data.

So an assumption of TMM is the majority of the genes are not differentially expressed. And as Devon pointed, different distributions of gene expression will result in different TMM normalizations.

ADD COMMENTlink modified 19 months ago • written 21 months ago by h.mon25k

Makes sense. Will check the paper again, thanks.

ADD REPLYlink written 21 months ago by sarahmanderni70

Do you have experience of applying it over TPM values?

ADD REPLYlink written 21 months ago by sarahmanderni70
1

I have none, but it seems you can do it (yes, you can).

ADD REPLYlink written 21 months ago by h.mon25k

I am also confused about normalization and statistics behind DE programs, I am using edgeR to analize two condittions.

Example for a gene ( raw-counts) four replicates by condition control (C) tratmeat (T) of a gene:

gene= FBgn0034710

Controles = 820 1618 1728 1007

Tratamientos= 7195 1252 1312 1291

Result of edgeR

logFC =1.10
logCPM = 6.5 LR = 9.77 PValue = 0.0017
FDR= 0.02

Why FBgn0034710 gene is statistically significant if one replicate (7195) has a lot of raw count in comparation with the others. I know that library size could be a factor but this is similar in the other replicates

ADD REPLYlink modified 19 months ago • written 19 months ago by vm.higareda20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1962 users visited in the last hour