Obtaining TPM values from STAR alignment and counts with featurecounts using R's tidyverse syntax (dplyr and tidyr)
1
0
Entering edit mode
2.2 years ago
Josh ▴ 20

Hello!

I have a table of counts that I got by aligning rna seq samples with STAR and using featureCounts, and my goal is to get TPM values for each gene of the table.

As a first step, I imported my table into R and modified it a bit to make it more understandable for me.

# load required packages
library("pacman")

p_load("vroom", "tidyverse")


# load data
GSE_fc <- vroom("/data/GSE_fc-table.txt", 
                      skip = 1) %>%
  select(1, 6:10) %>%
  rename("SRR0" = 3,
         "SRR1" = 4,
         "SRR2" = 5,
         "SRR3" = 6)

The table of counts looks like this:

 > head(GSE_fc)
# A tibble: 6 × 6
  Geneid          Length      SRR0       SRR1       SRR2       SRR3
  <chr>            <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
1 ENSG00000160072   7686       1790        537       1366        740
2 ENSG00000279928    570          0          0          0          0
3 ENSG00000228037    626          0          0          0          0
4 ENSG00000142611   9767          2          0          1          0
5 ENSG00000284616   1225          0          0          0          0
6 ENSG00000157911   3782        988        429       1028        444

I used as a reference the following links that indicate how to obtain TPM values using the Length column and the columns containing the count values from the count table:

http://ny-shao.name/2016/11/18/a-short-script-to-calculate-rpkm-and-tpm-from-featurecounts-output.html

https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/

Raw counts to TPM in R

# RPKM versus TPM
#
# RPKM and TPM are both normalized for library size and gene length.
#
# RPKM is not comparable across different samples.
#
# For more details, see: http://blog.nextgenetics.net/?e=51
rpkm <- function(counts, lengths) {
rate <- counts / lengths
rate / sum(counts) * 1e6
}
tpm <- function(counts, lengths) {
rate <- counts / lengths
rate / sum(rate) * 1e6
}
genes <- data.frame(
Gene = c("A","B","C","D","E"),
Length = c(100, 50, 25, 5, 1)
)
counts <- data.frame(
S1 = c(80, 10, 6, 3, 1),
S2 = c(20, 20, 10, 50, 400)
)
rpkms <- apply(counts, 2, function(x) rpkm(x, genes$Length))
tpms <- apply(counts, 2, function(x) tpm(x, genes$Length))
genes
# Gene Length
# 1 A 100
# 2 B 50
# 3 C 25
# 4 D 5
# 5 E 1
counts
# S1 S2
# 1 80 20
# 2 10 20
# 3 6 10
# 4 3 50
# 5 1 400
rpkms
# S1 S2
# [1,] 8000 4e+02
# [2,] 2000 8e+02
# [3,] 2400 8e+02
# [4,] 6000 2e+04
# [5,] 10000 8e+05
tpms
# S1 S2
# [1,] 281690.14 486.618
# [2,] 70422.54 973.236
# [3,] 84507.04 973.236
# [4,] 211267.61 24330.900
# [5,] 352112.68 973236.010
# Sample means should be equal.
colSums(rpkms)
# S1 S2
# 28400 822000
colSums(tpms)
# S1 S2
# 1e+06 1e+06
colMeans(rpkms)
# S1 S2
# 5680 164400
colMeans(tpms)
# S1 S2
# 2e+05 2e+05

I adapted pieces of code to calculate TPM values and transformed it to dplyr and tidyr syntax (R packages). It looks like this:

# function to calculate tpm values 
tpm <- function(counts, lengths) {
  rate <- counts / lengths
  rate / sum(rate) * 1e6
}

# code to obtain tpm values from a dataframe
GSE128263_fc.tpm <- GSE128263_fc %>% 
  pivot_longer(starts_with("SRR"), 
               names_to = "sample", 
               values_to = "counts") %>% 
  group_by(sample) %>% 
  mutate(tpm=tpm(counts, Length)) %>% 
  pivot_wider(names_from = sample,
              values_from = c(4:5))

However, reading other similar posts that also calculate TPM,

Calculating TPM from featureCounts output

#' Convert counts to transcripts per million (TPM).
#'
#' Convert a numeric matrix of features (rows) and conditions (columns) with
#' raw feature counts to transcripts per million.
#'
#' Lior Pachter. Models for transcript quantification from RNA-Seq.
#' arXiv:1104.3889v2
#'
#' Wagner, et al. Measurement of mRNA abundance using RNA-seq data:
#' RPKM measure is inconsistent among samples. Theory Biosci. 24 July 2012.
#' doi:10.1007/s12064-012-0162-3
#'
#' @param counts A numeric matrix of raw feature counts i.e.
#' fragments assigned to each gene.
#' @param featureLength A numeric vector with feature lengths.
#' @param meanFragmentLength A numeric vector with mean fragment lengths.
#' @return tpm A numeric matrix normalized by library size and feature length.
counts_to_tpm <- function(counts, featureLength, meanFragmentLength) {
# Ensure valid arguments.
stopifnot(length(featureLength) == nrow(counts))
stopifnot(length(meanFragmentLength) == ncol(counts))
# Compute effective lengths of features in each library.
effLen <- do.call(cbind, lapply(1:ncol(counts), function(i) {
featureLength - meanFragmentLength[i] + 1
}))
# Exclude genes with length less than the mean fragment length.
idx <- apply(effLen, 1, function(x) min(x) > 1)
counts <- counts[idx,]
effLen <- effLen[idx,]
featureLength <- featureLength[idx]
# Process one column at a time.
tpm <- do.call(cbind, lapply(1:ncol(counts), function(i) {
rate = log(counts[,i]) - log(effLen[,i])
denom = log(sum(exp(rate)))
exp(rate - denom + log(1e6))
}))
# Copy the row and column names from the original matrix.
colnames(tpm) <- colnames(counts)
rownames(tpm) <- rownames(counts)
return(tpm)
}
view raw counts_to_tpm.R hosted with ❤ by GitHub

I see that they do the calculation in a different way and my specific question is to know if what I did, and wrote in this post, is right.

My table at the end of the calculation of the TPM values looks like this

> head(GSE128263_fc.tpm)
# A tibble: 6 × 10
  Geneid   Length count…¹ count…² count…³ count…⁴ tpm_S…⁵ tpm_S…⁶ tpm_S…⁷ tpm_S…⁸
  <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 ENSG000…   7686    1790     537    1366     740 35.6       33.0 23.9       25.0
2 ENSG000…    570       0       0       0       0  0          0    0          0  
3 ENSG000…    626       0       0       0       0  0          0    0          0  
4 ENSG000…   9767       2       0       1       0  0.0313     0    0.0138     0  
5 ENSG000…   1225       0       0       0       0  0          0    0          0  
6 ENSG000…   3782     988     429    1028     444 40.0       53.6 36.6       30.4

Thank you for your attention and support in advance

R rnaseq tidyverse tpm • 3.8k views
ADD COMMENT
2
Entering edit mode
2.2 years ago

Hang on. You want to use transcript lengths, not gene lengths. How can you think that the lengths of introns is relevant here? And since each gene has many transcripts of different lengths, you would need to know the relative abundance of all those transcripts, which STAR can't do for you.

Just start over with Kallisto, it will generate TPM calculated properly. Or, if you have the transcript alignment from STAR, RSEM can use that input to give you TPM.

ADD COMMENT
2
Entering edit mode

You can also give the transcript alignments from STAR directly to salmon which will give you properly computed TPMs (and will do so faster than RSEM).

ADD REPLY
1
Entering edit mode

Gene lengths values from featureCounts do not include introns.

ADD REPLY
0
Entering edit mode

But it's still not the right value. The overlap of all the exons doesn't represent the real length of anything, "correcting" with it doesn't solve anything.

ADD REPLY
0
Entering edit mode

I did not say it was the right value for TPMs. featureCounts is not designed to compute TPMs anyway. I am simply asking you to edit your answer to remove the misinformation about introns.

BTW, if you look back through past threads, you will see that OP has already run Salmon.

ADD REPLY
1
Entering edit mode

Thank you all for your answers and suggestions. I will re-run the alignments with STAR and get the counts and TPM values with salmon. :)

ADD REPLY

Login before adding your answer.

Traffic: 2205 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6