TMM normalisation on a subtset of the data
Entering edit mode
5.7 years ago


I'm working on a RNA-Seq dataset and would like to use the TMM normalization from the edgeR package to normalize the data. I have read the manual and also the paper here.

I have two questions regarding the TMM normalization.

first, in our data, we are mostly interested in specific regions on the chromosomes. For that reason we extracted these regions from the complete mapped bam files using samtools. Does it make a different for the TMM normalization if I am taking only the extracted specific regions into account when normalizing the data rather than taking the whole library.

I know that the values I'm getting at the end will differ due to the fact that I have different numbers of reads mapped to the region of interest. BUT all in all, can I use the TMM normalization only on the extracted subset of the data?

Second, Can someone please try to explain to me the main difference between the scaling method of normalization and the normalization by library size?

I don't think I really got it from the paper.




edgeR TMM normlization • 2.3k views
Entering edit mode
5.7 years ago

Normalizing on the subset data should be fine unless you expect a very large percentage of that subset to be differentially expressed. If you expect that, then normalizing on the subset will completely screw up the results (the differences may disappear due to the normalization).

Regarding TMM vs library size normalization, I have yet to see any case where library size normalization appropriate. This method is known to not be robust and will produce completely crap results if you have a few highly expressed genes changing in expression. The whole point of TMM (and the similar method in DESeq2) is to normalize in a robust manner, by removing undue influence by a few highly-expressed genes.

Entering edit mode

Hi Devon and thanks for the fast response. Do I understand it correctly if I say ( and I sort of quote the paper here) that the TMM normalisation computes the proportion of each gene's reads relative to the total number of reads in the library and compare that across all samples?

and what about the other way around? What if a large number of genes, which suppose to be differentially expressed are not in this subset of interest? Will it than skew the results in an unwanted way?


Login before adding your answer.

Traffic: 1478 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6