Hello!
I have a table of counts that I got by aligning rna seq samples with STAR and using featureCounts, and my goal is to get TPM values for each gene of the table.
As a first step, I imported my table into R and modified it a bit to make it more understandable for me.
# load required packages
library("pacman")
p_load("vroom", "tidyverse")
# load data
GSE_fc <- vroom("/data/GSE_fc-table.txt",
skip = 1) %>%
select(1, 6:10) %>%
rename("SRR0" = 3,
"SRR1" = 4,
"SRR2" = 5,
"SRR3" = 6)
The table of counts looks like this:
> head(GSE_fc)
# A tibble: 6 × 6
Geneid Length SRR0 SRR1 SRR2 SRR3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000160072 7686 1790 537 1366 740
2 ENSG00000279928 570 0 0 0 0
3 ENSG00000228037 626 0 0 0 0
4 ENSG00000142611 9767 2 0 1 0
5 ENSG00000284616 1225 0 0 0 0
6 ENSG00000157911 3782 988 429 1028 444
I used as a reference the following links that indicate how to obtain TPM values using the Length column and the columns containing the count values from the count table:
https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/
# RPKM versus TPM | |
# | |
# RPKM and TPM are both normalized for library size and gene length. | |
# | |
# RPKM is not comparable across different samples. | |
# | |
# For more details, see: http://blog.nextgenetics.net/?e=51 | |
rpkm <- function(counts, lengths) { | |
rate <- counts / lengths | |
rate / sum(counts) * 1e6 | |
} | |
tpm <- function(counts, lengths) { | |
rate <- counts / lengths | |
rate / sum(rate) * 1e6 | |
} | |
genes <- data.frame( | |
Gene = c("A","B","C","D","E"), | |
Length = c(100, 50, 25, 5, 1) | |
) | |
counts <- data.frame( | |
S1 = c(80, 10, 6, 3, 1), | |
S2 = c(20, 20, 10, 50, 400) | |
) | |
rpkms <- apply(counts, 2, function(x) rpkm(x, genes$Length)) | |
tpms <- apply(counts, 2, function(x) tpm(x, genes$Length)) | |
genes | |
# Gene Length | |
# 1 A 100 | |
# 2 B 50 | |
# 3 C 25 | |
# 4 D 5 | |
# 5 E 1 | |
counts | |
# S1 S2 | |
# 1 80 20 | |
# 2 10 20 | |
# 3 6 10 | |
# 4 3 50 | |
# 5 1 400 | |
rpkms | |
# S1 S2 | |
# [1,] 8000 4e+02 | |
# [2,] 2000 8e+02 | |
# [3,] 2400 8e+02 | |
# [4,] 6000 2e+04 | |
# [5,] 10000 8e+05 | |
tpms | |
# S1 S2 | |
# [1,] 281690.14 486.618 | |
# [2,] 70422.54 973.236 | |
# [3,] 84507.04 973.236 | |
# [4,] 211267.61 24330.900 | |
# [5,] 352112.68 973236.010 | |
# Sample means should be equal. | |
colSums(rpkms) | |
# S1 S2 | |
# 28400 822000 | |
colSums(tpms) | |
# S1 S2 | |
# 1e+06 1e+06 | |
colMeans(rpkms) | |
# S1 S2 | |
# 5680 164400 | |
colMeans(tpms) | |
# S1 S2 | |
# 2e+05 2e+05 |
I adapted pieces of code to calculate TPM values and transformed it to dplyr and tidyr syntax (R packages). It looks like this:
# function to calculate tpm values
tpm <- function(counts, lengths) {
rate <- counts / lengths
rate / sum(rate) * 1e6
}
# code to obtain tpm values from a dataframe
GSE128263_fc.tpm <- GSE128263_fc %>%
pivot_longer(starts_with("SRR"),
names_to = "sample",
values_to = "counts") %>%
group_by(sample) %>%
mutate(tpm=tpm(counts, Length)) %>%
pivot_wider(names_from = sample,
values_from = c(4:5))
However, reading other similar posts that also calculate TPM,
Calculating TPM from featureCounts output
#' Convert counts to transcripts per million (TPM). | |
#' | |
#' Convert a numeric matrix of features (rows) and conditions (columns) with | |
#' raw feature counts to transcripts per million. | |
#' | |
#' Lior Pachter. Models for transcript quantification from RNA-Seq. | |
#' arXiv:1104.3889v2 | |
#' | |
#' Wagner, et al. Measurement of mRNA abundance using RNA-seq data: | |
#' RPKM measure is inconsistent among samples. Theory Biosci. 24 July 2012. | |
#' doi:10.1007/s12064-012-0162-3 | |
#' | |
#' @param counts A numeric matrix of raw feature counts i.e. | |
#' fragments assigned to each gene. | |
#' @param featureLength A numeric vector with feature lengths. | |
#' @param meanFragmentLength A numeric vector with mean fragment lengths. | |
#' @return tpm A numeric matrix normalized by library size and feature length. | |
counts_to_tpm <- function(counts, featureLength, meanFragmentLength) { | |
# Ensure valid arguments. | |
stopifnot(length(featureLength) == nrow(counts)) | |
stopifnot(length(meanFragmentLength) == ncol(counts)) | |
# Compute effective lengths of features in each library. | |
effLen <- do.call(cbind, lapply(1:ncol(counts), function(i) { | |
featureLength - meanFragmentLength[i] + 1 | |
})) | |
# Exclude genes with length less than the mean fragment length. | |
idx <- apply(effLen, 1, function(x) min(x) > 1) | |
counts <- counts[idx,] | |
effLen <- effLen[idx,] | |
featureLength <- featureLength[idx] | |
# Process one column at a time. | |
tpm <- do.call(cbind, lapply(1:ncol(counts), function(i) { | |
rate = log(counts[,i]) - log(effLen[,i]) | |
denom = log(sum(exp(rate))) | |
exp(rate - denom + log(1e6)) | |
})) | |
# Copy the row and column names from the original matrix. | |
colnames(tpm) <- colnames(counts) | |
rownames(tpm) <- rownames(counts) | |
return(tpm) | |
} |
I see that they do the calculation in a different way and my specific question is to know if what I did, and wrote in this post, is right.
My table at the end of the calculation of the TPM values looks like this
> head(GSE128263_fc.tpm)
# A tibble: 6 × 10
Geneid Length count…¹ count…² count…³ count…⁴ tpm_S…⁵ tpm_S…⁶ tpm_S…⁷ tpm_S…⁸
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG000… 7686 1790 537 1366 740 35.6 33.0 23.9 25.0
2 ENSG000… 570 0 0 0 0 0 0 0 0
3 ENSG000… 626 0 0 0 0 0 0 0 0
4 ENSG000… 9767 2 0 1 0 0.0313 0 0.0138 0
5 ENSG000… 1225 0 0 0 0 0 0 0 0
6 ENSG000… 3782 988 429 1028 444 40.0 53.6 36.6 30.4
Thank you for your attention and support in advance
You can also give the transcript alignments from STAR directly to salmon which will give you properly computed TPMs (and will do so faster than RSEM).
Gene lengths values from featureCounts do not include introns.
But it's still not the right value. The overlap of all the exons doesn't represent the real length of anything, "correcting" with it doesn't solve anything.
I did not say it was the right value for TPMs. featureCounts is not designed to compute TPMs anyway. I am simply asking you to edit your answer to remove the misinformation about introns.
BTW, if you look back through past threads, you will see that OP has already run Salmon.
Thank you all for your answers and suggestions. I will re-run the alignments with STAR and get the counts and TPM values with salmon. :)