Question

TMM normalized TPM

0

Entering edit mode

3.2 years ago

firestar ★ 1.6k

I have raw counts and gene lengths. Any tool that can compute TMM normalised gene length corrected data? I want to compare expression of one gene to another gene. Standard TPM doesn't seem to take compositional bias into account?

Would something like this in DESeq2 work?

x <- raw_counts/gene_length

library(DESeq2)
d <- DESeqDataSetFromMatrix(countData=x,colData=m,design=~Sample)
d <- DESeq2::estimateSizeFactors(d,type="ratio")
y <- counts(d,normalized=TRUE)

And y would be GeTMM.

RNA-Seq • 2.5k views

ADD COMMENT • link 3.2 years ago by firestar ★ 1.6k

0

Entering edit mode

I think so, although I would do the opposite since DEseq2 expects raw counts as input (first do the TMM normalization, then scale to gene length). Not sure whether it will change the results much, but at least it should prevent DESeq2 from throwing warnings.

ADD REPLY • link 3.2 years ago by Carlo Yague 8.6k

0

Entering edit mode

I tried to do a reproducible example to go from raw counts to TMM normalized gene length corrected TPM counts.

Download sample data file GSE60450_Lactation-GenewiseCounts.txt from https://figshare.com/s/1d788fd384d33e913a2a.

raw_data <- read.delim("GSE60450_Lactation-GenewiseCounts.txt",header=T)
colnames(raw_data) <- c("gene","length",LETTERS[1:12])
head(raw_data)

##        gene length   A   B   C   D   E   F   G   H   I   J
## 1    497097   3634 438 300  65 237 354 287   0   0   0   0
## 2 100503874   3259   1   0   1   1   0   4   0   0   0   0
## 3 100038431   1634   0   0   0   0   0   0   0   0   0   0
## 4     19888   9747   1   1   0   0   0   0  10   3  10   2
## 5     20671   3130 106 182  82 105  43  82  16  25  18   8
## 6     27395   4203 309 234 337 300 290 270 560 464 489 328
##     K   L
## 1   0   0
## 2   0   0
## 3   0   0
## 4   0   0
## 5   3  10
## 6 307 342

Take count data to a new object. Add genes as rownames.

cr <- raw_data[,3:14]
rownames(cr) <- raw_data$gene
cr[1:5,1:5]

##             A   B  C   D   E
## 497097    438 300 65 237 354
## 100503874   1   0  1   1   0
## 100038431   0   0  0   0   0
## 19888       1   1  0   0   0
## 20671     106 182 82 105  43

Create DESeq2 object.

d <- DESeqDataSetFromMatrix(countData=cr,colData=data.frame(Sample=factor(colnames(cr))),design=~Sample)
d <- DESeq2::estimateSizeFactors(d,type="ratio")

Then the gene length file is prepared. I am really not sure about this part. DESeq2 needs an assay of gene lengths I think. So I prepared a dataframe equal in dimensions to the input count dataframe.

# prepare gene length data
gl <- matrix(rep(raw_data$length,length(cr)),byrow=F,nrow=nrow(cr))
colnames(gl) <- colnames(cr)
rownames(gl) <- rownames(cr)
head(gl)

             A    B    C    D    E    F    G    H    I    J    K    L
497097    3634 3634 3634 3634 3634 3634 3634 3634 3634 3634 3634 3634
100503874 3259 3259 3259 3259 3259 3259 3259 3259 3259 3259 3259 3259
100038431 1634 1634 1634 1634 1634 1634 1634 1634 1634 1634 1634 1634
19888     9747 9747 9747 9747 9747 9747 9747 9747 9747 9747 9747 9747
20671     3130 3130 3130 3130 3130 3130 3130 3130 3130 3130 3130 3130
27395     4203 4203 4203 4203 4203 4203 4203 4203 4203 4203 4203 4203

Add gene lengths to DESeq2 object.

assays(d)$avgTxLength <- gl

Now, FPM and FPKM can be computed.

fpm(d,robust=TRUE)[1:5,1:5]

##                     A           B          C           D
## 497097    18.85684388 13.77543859 2.69700983 10.45648006
## 100503874  0.04305215  0.00000000 0.04149246  0.04412017
## 100038431  0.00000000  0.00000000 0.00000000  0.00000000
## 19888      0.04305215  0.04591813 0.00000000  0.00000000
## 20671      4.56352843  8.35709941 3.40238163  4.63261775
##                   E
## 497097    16.442685
## 100503874  0.000000
## 100038431  0.000000
## 19888      0.000000
## 20671      1.997275

fpkm(d,robust=TRUE)[1:5,1:5]

##                     A           B          C          D
## 497097    4.165175871 3.092204540 0.66005325 2.70713930
## 100503874 0.010603758 0.000000000 0.01132312 0.01273687
## 100038431 0.000000000 0.000000000 0.00000000 0.00000000
## 19888     0.003545465 0.003842916 0.00000000 0.00000000
## 20671     1.170322849 2.178005299 0.96676308 1.39249019
##                   E
## 497097    4.4166109
## 100503874 0.0000000
## 100038431 0.0000000
## 19888     0.0000000
## 20671     0.6228664

For TPM, calculate FPKM, then divide by sum of column and multiply by 10⁶.

counts_fpkm <- fpkm(d,robust=TRUE)
counts_tpm <- (counts_fpkm/colSums(counts_fpkm))*10^6
counts_tpm[1:5,1:5]

##                      A           B          C           D
## 497097    11.753134666 2.084051870 0.42981212 4.059786074
## 100503874  0.029089363 0.000000000 0.00763144 0.008293971
## 100038431  0.000000000 0.000000000 0.00000000 0.000000000
## 19888      0.008168239 0.009407778 0.00000000 0.000000000
## 20671      2.718478619 5.017809428 2.36671617 3.820028160
##                  E
## 497097    7.522179
## 100503874 0.000000
## 100038431 0.000000
## 19888     0.000000
## 20671     1.757581

And now we compute sum of columns for TPM.

colSums(counts_tpm)

##         A         B         C         D         E         F 
##  740336.2  754730.7  851205.2  901837.7  913654.5  831364.9 
##         G         H         I         J         K         L 
##  754482.1  765813.5 1275904.8 1401345.7 2865426.6 2598466.1

Doesn't add up to 1M.

ADD REPLY • link 3.2 years ago by firestar ★ 1.6k

1

Entering edit mode

use t to transpose the matrix before dividing by the sum of columns and transpose it back:

counts_tpm <- t(t(counts_fpkm)/colSums(counts_fpkm))*10^6
colSums(counts_tpm)
##     A     B     C     D     E     F     G     H     I     J     K     L 
## 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06

ADD REPLY • link 3.2 years ago by Carlo Yague 8.6k

0

Entering edit mode

Yes! That works! Thanks!

ADD REPLY • link 3.2 years ago by firestar ★ 1.6k

score 0 · Answer 1 · 2021-02-19

0

Entering edit mode

3.2 years ago

ATpoint 81k

Simply use the DESeq2::fpkm function. It does fpkm but robustified with its sizefactor-based normalization approach. That should be all you need I guess. Alternatively use edgeR, so make a DGEList, then run calcNormFactors and then edgeR::rpkm, that is from the concept the same as DESeq2::fpkm.

ADD COMMENT • link 3.2 years ago by ATpoint 81k

0

Entering edit mode

Do all columns sum up to 1 million in this case?

ADD REPLY • link 3.2 years ago by firestar ★ 1.6k

1

Entering edit mode

No. But you can convert FPKMs to TPMs with a simple formula. See: fpkm to tpm conversion

ADD REPLY • link 3.2 years ago by igor 13k

0

Entering edit mode

That's brilliant! Thanks!

ADD REPLY • link 3.2 years ago by firestar ★ 1.6k

0

Entering edit mode

I do not know, probably not.

ADD REPLY • link 3.2 years ago by ATpoint 81k

0

Entering edit mode

Experts, correct me if I am wrong. I read that, one of the advantages of TPM is that columns sum to 1M so that the proportion of a gene is comparable sample to sample. And FPKM doesn't do this. So is it advisable to use FPKM?

ADD REPLY • link 3.2 years ago by firestar ★ 1.6k

1

Entering edit mode

If they are "robust" TPMs, they will not add up to 1M, but will be 1M on average.

ADD REPLY • link 3.2 years ago by igor 13k

1

Entering edit mode

Yup, and the FPKM to TPM conversion won't get you robust TPMs -- to get robust TPMs, it's advisable to first calculate your TPMs (which sum up to 1E6), then apply the size factor normalization on the TPMs. This is what's done in a few differential gene expression packages (e.g. sleuth).

ADD REPLY • link 3.2 years ago by dsull ★ 5.8k