Convert FPKM to TPM in R
3
2
Entering edit mode
18 months ago
JACKY ▴ 140

I'm conducting a meta-analysis over several datasets. I want to combine those datasets and run some machine learning algorithms to predict a target response. Some of those datasets are raw counts, which I can easily convert to TPM with the following code:

rpkm <- apply(X = subset(counts_data),
MARGIN = 2,
FUN = function(x) {
10^9 * x / genelength / sum(as.numeric(x))
})

TPM <- apply(rpkm, 2, function(x) x / sum(as.numeric(x)) * 10^6) %>% as.data.frame()


And some datasets provide RPKM data, which I can also convert to TPM like this:

TPM= apply(RPKM, 2, function(x) x / sum(as.numeric(x)) * 10^6) %>% as.data.frame()


Some datasets, however, only provide FPKM data. This is problamatic, I need all datasets to be TPM normalized, and I'm not familiar with converting FPKM to TPM.

Is it possible to convert FPKM reads to TPM? I found this approach: TPM = FPKM * X where X = 1e6/[sum of all FPKM of a sample].

I'm not sure if I'm allowed to do this, I don't want to use it and get misleading results. What to do guys think? if I can use it, what is the code in R?

Note: the datasets that provide RPKM or FPKM have no raw data or counts data.

r TPM normalization meta-analysis • 3.8k views
4
Entering edit mode
5 months ago
DareDevil ★ 4.3k

TPM(i) = ( FPKM(i) / sum ( FPKM all transcripts ) ) * 10^6

TPM = (((mean transcript length in kilobases) x RPKM) / sum(RPKM all genes)) * 10^6

To convert fpkm to tpm first generate dummy FPKM data

num_genes <- 1000
num_samples <- 5

fpkm_matrix <- matrix(rexp(num_genes * num_samples, rate = 0.1), nrow = num_genes)
colnames(fpkm_matrix) <- paste0("Sample_", 1:num_samples)
rownames(fpkm_matrix) <- paste0("Gene_", 1:num_genes)


Create a function for tpm based on above formula

sum_fpkm_per_sample <- colSums(fpkm_matrix)
scaling_factors <- sum_fpkm_per_sample / 1e6
tpm_matrix <- t(t(fpkm_matrix) / scaling_factors * 1e6)

1
Entering edit mode

I apologize for my previous comment - the code looked really similar to the one generated by ChatGPT and your history of using ChatGPT triggered a suspicion. I'll delete my other comment.

0
Entering edit mode

is really the function working as expected ?

The output for the gene_expression dataset is :

print(tpm_data)
Gene    Sample1  Sample2  Sample3
1 Gene1 166666.7 250000.0 133333.3
2 Gene2 266666.7 333333.3 240000.0
3 Gene3 555555.6 648148.1 518518.5


If we had sequenced only two samples, the output is different :

print(tpm_data2)
Gene    Sample1  Sample2
1 Gene1 166666.7 200000.0
2 Gene2 266666.7 416666.7
3 Gene3 500000.0 466666.7


Here, based on your function, a modification that yields same result.

fpkm_to_tpm <- function(fpkm_dat){

fpkm_dat %>%
pivot_longer(-names(fpkm_dat)[1], names_to = "sample", values_to = "fpkm") %>%
group_by(tissue) %>%

mutate(total_fpkm_per_sample = sum(fpkm),            # sum of FPKM values per sample
scaling_factor  = total_fpkm_per_sample/ 1e6, # scaling factor per sample
tpm_values = fpkm / scaling_factor) %>%       # calculate TPM values

select(names(fpkm_dat)[1], sample, tpm_values) %>%
pivot_wider(names_from = "sample", values_from = tpm_values)
}


Function called on the gene_expression dataset :

fpkm_to_tpm(gene_expression)
# A tibble: 3 × 4
Gene   Sample1 Sample2 Sample3
<chr>    <dbl>   <dbl>   <dbl>
1 Gene1 166667. 200000  148148.
2 Gene2 333333. 333333. 333333.
3 Gene3 500000  466667. 518519.


Function called on a subset of the gene_expression dataset :

fpkm_to_tpm(gene_expression %>% select(1:3))
# A tibble: 3 × 3
Gene  Sample1 Sample2
<chr>   <dbl>   <dbl>
1 Gene1 166667. 200000
2 Gene2 333333. 333333.
3 Gene3 500000  466667.

1
Entering edit mode
18 months ago

I think your TPM from FPKM calculation is correct. See the section Relationship between TPM and FPKM in this helpful blog post by Harold Pimentel that recites a manuscript by his PhD advisor Lior Patcher.

3
Entering edit mode

Great! Thank you! If anyone in the future need a solution for this question, here is the code to do this:

library(tidyverse); fpkm_data%>% mutate(across(everything(), ~(./sum(.))*10**6)

0
Entering edit mode
7 weeks ago
ajay nair ▴ 50

FPKM can be converted to TPM and the approach you found is correct (and it is also same as the RPKM to TPM conversion). FPKM and RPKM are conceptually same normalization, only that they are applied to paired-end and single-end RNA-seq methods, respectively. If you notice, your function to convert RPKM to TPM

TPM <- apply(RPKM, 2, function(x) x / sum(as.numeric(x)) * 10^6)


is doing the same thing as the approach you mention for FPKM

TPM = FPKM/[sum of all FPKM of a sample]*10^6


I have made a few modifications to your own codes and functions above to make the similarities more clear.