Question

How to normalized TPM with TMM method?

4

Entering edit mode

4.8 years ago

xiaoguang ▴ 140

We can use edgeR to normalized raw read counts with TMM method, but how can we normalized TPM with the same method?

rna-seq edgeR TPM TMM • 11k views

ADD COMMENT • link updated 2.1 years ago by DS ▴ 10 • written 4.8 years ago by xiaoguang ▴ 140

score 16 · Accepted Answer · 2019-08-15

16

Entering edit mode

4.7 years ago

Gordon Smyth ★ 7.0k

I don't quite follow your question. TMM is a method for normalizing the library sizes rather than a method for normalizing read counts. As the edgeR User's Guide says (page 15):

normalization in edgeR is model-based, and the original read counts are not themselves transformed.

Which way around is your question? Do you have TPMs and want to compute TMM factors or do you have TMM factors and want to compute TPMs?

If you are asking the first question, then no, TMM factors can only be computed from the raw counts, not from quantities such as TPMs or CPMs from which the library sizes have already been divided out. If you already have TPMs from some software package, then normalization has almost certainly already been applied, so I would be very wary about trying to re-normalize them unless you really know what you're doing.

If you are asking the second question then, yes, TMM factors can in principle be used to compute TPMs. In edgeR, any downstream quantity that is computed from the library sizes will incorporate the TMM factors automatically, because the factors are considered part of the effective library sizes. TMM normalization factors will be applied automatically when you use

CPM <- cpm(dge)

or

RPKM <- rpkm(dge)

in edgeR to compute CPMs or RPKMs from a DGEList object. I don't necessarily recommend TPM values myself, but if you go on to compute TPMs by

TPM <- t( t(RPKM) / colSums(RPKM) ) * 1e6

then the TMM factors will naturally have been incorporated into the computation.

ADD COMMENT • link 4.7 years ago by Gordon Smyth ★ 7.0k

1

Entering edit mode

Dear Gordon Smyth - would this be the correct way to calculate TMM-normalized logTPM incorporating edgeR's use of prior.count (assuming it's better than doing log2(TPM + 1))? And by the way not meaning to use this for DGE but for analyses where having both inter- and intra-sample normalization performed is important.

dge <- DGEList(counts=counts, genes=data.frame(Length=lengths)
dge <- dge[filterByExpr(dge), , keep.lib.sizes=FALSE]
dge <- calcNormFactors(dge)

prior_count <- 2

lib_size <- dge$samples$lib.size * dge$samples$norm.factors
scaled_prior_count <- prior_count * lib_size / mean(lib_size)
adj_lib_size <- lib_size + 2 * scaled_prior_count

fpkm <- t((t(dge$counts) + scaled_prior_count) / adj_lib_size) * 1e6 / dge$genes$Length * 1e3
tpm <- t(t(fpkm) / colSums(fpkm)) * 1e6
log_tpm <- log2(tpm)

For those who want to test this, you can verify the above fpkm code is correct with your data:

all.equal(log2(fpkm), rpkm(dge, log=TRUE, prior.count=prior_count))

[1] TRUE

ADD REPLY • link 2.1 years ago by hermidalc ▴ 60

0

Entering edit mode

Hi, thanks for this information. Could you please help me understand the first step?

dge <- DGEList(counts=counts, genes=data.frame(Length=lengths)

I assume counts has row names the gene_id, followed by each individual as a column, but how about the gene length on the genes argument?

ADD REPLY • link 2.1 years ago by DS ▴ 10

0

Entering edit mode

The non-overlapping exon length of each gene, easily calculated with a couple lines of code using GenomicFeatures package

Extract Total Non-Overlapping Exon Length Per Gene With Bioconductor

ADD REPLY • link 2.1 years ago by hermidalc ▴ 60

0

Entering edit mode

thanks for the rapid response, I have already extracted gene_length and GC content from bioMart in this format:

r_xtract[1:3,]

	length	gc
ENSG00000000003	4536	0.3992504
ENSG00000000005	1476	0.4241192
ENSG00000000419	9276	0.4252911

but when I try to input as:

dge <- DGEList(counts=unrom_counts, genes=data.frame(r_xtract$length))

$ operator is invalid for atomic vectors

then I remove the gc column

test=r_xtract[,-2]

test[1:3]

ENSG00000000003	ENSG00000000005	ENSG00000000419
4536	1476	9276

dge <- DGEList(counts=unrom_counts, genes=data.frame(test))

Error: Counts and genes have different numbers of rows

So the question is purely technical, what format should have the gene argument to match the counts?

genes=data.frame(Length=lengths)

Thanks again! :)

ADD REPLY • link 2.1 years ago by DS ▴ 10

0

Entering edit mode

You should make sure the rows of your count matrix (gene rows x sample cols) match the gene lengths vector…

ADD REPLY • link 2.1 years ago by hermidalc ▴ 60

0

Entering edit mode

you're a boss! thanks so much!

ADD REPLY • link 2.1 years ago by DS ▴ 10

0

Entering edit mode

Would that not be an idea if you wanted to do a inter-library normalization of of the TPM values? Kinda like what's happing under the hood of edgeR::cpm() which allows the CPM values to be normalized according to the TMM normalization factors - just for TPM values?

ADD REPLY • link 4.7 years ago by Kristoffer Vitting-Seerup ★ 4.0k

1

Entering edit mode

TMM normalizes library sizes and can only be applied to counts. Any downstream quantity such as CPMs or TPMs that are computed from library sizes will obviously incorporate the TMM normalization, but that is not the same thing as trying to estimate the TMM factors from the CPMs or TPMs.

In my opinion, normalization should be done prior to the computation of TPMs but, even if you wanted to re-normalize TPMs for some reason, one would not input them to the calcNormFactors function in edgeR, which is how I interpreted OP's question.

Or perhaps I have mis-interpreted the question. Perhaps you and the OP are actually asking a simpler question about whether TMM-normalized library sizes can be used to compute TPMs. That is of course true, and I have edited my answer now to say that.

ADD REPLY • link 4.7 years ago by Gordon Smyth ★ 7.0k

0

Entering edit mode

But is the main goal of the inter-library normalisation not simply to make sure the expression/abundance distributions from multiple samples are comparable? I agree that it would be a nicer approach to re-calcuate the TPM values from the TMM normalized counts. But if we ignore that possibility for a moment I'm not sure I see why inter-library normalisation of TPM values are a bad idea - would it not just produce abundance estimates that are more comparable across conditions? Or does TMM normalisation itself assumes something about the data which does not hold true for TPM values?

ADD REPLY • link 4.7 years ago by Kristoffer Vitting-Seerup ★ 4.0k

0

Entering edit mode

Just out of curiosity I tried it and it does look like there is some improvement on a dataset I already had open as can be seen here. Naturally this is not a "bad" dataset which needs it - but sometimes it could be needed (as TPM normalization does not account for e.g. different levels of rRNA contamination).

ADD REPLY • link 4.7 years ago by Kristoffer Vitting-Seerup ★ 4.0k

0

Entering edit mode

I don't know what you are referring to when you say that you "tried it", or what type of data you have applied it to, or what you mean by "some improvement". I am not going to continue this discussion.

I have tried to answer OP's question as completely as I can. I have nothing to add to what I have already written.

ADD REPLY • link 4.7 years ago by Gordon Smyth ★ 7.0k