Why counts to tpm gives different values compared to tpm from kallisto?
0
0
Entering edit mode
7 months ago
newbie ▴ 90

I have RNA-Seq data with raw counts extracted from bam files using featureCounts. I wanted to convert counts to tpm, because I need TPM for some analysis.

Initially, I got the coding length for each Gene like below:

library(GenomicFeatures)
gtf_txdb <- makeTxDbFromGFF("gencode.v27.annotation.gtf")
# then collect the exons per gene id
exons.list.per.gene <- exonsBy(gtf_txdb,by="gene")
# then for each gene, reduce all the exons to a set of non overlapping exons, calculate their lengths (widths) and sum then
exonic.gene.sizes <- sum(width(reduce(exons.list.per.gene)))
exonic.gene.sizes <- as.data.frame(sum(width(reduce(exons.list.per.gene))))


Now, I have two data frames like data is with raw counts and other data2 with gene name and Coding_Length

tpm <- function(counts, lengths) {
rate <- counts / lengths
rate / sum(rate, na.rm = TRUE) * 1e6
}

final_tpm <- apply(data, 2, function(x) tpm(x, data2$Coding_Length))  final_tpm using the function for some of the genes: gene tpm RPSAP8 3.79 AL645608.8 4.14 RNF223 1.38  And I also used other function like below on dataframe df which have two columns raw counts and effLength calculated from Coding_Length df$effLength <- df$Coding_Length - 203.7 + 1 countToTpm <- function(counts, effLen) { rate <- log(counts) - log(effLen) denom <- log(sum(exp(rate))) exp(rate - denom + log(1e6)) } df$tpm <- with(df, countToTpm(counts, effLength))


With the above way the output of few genes is like below:

gene_name       tpm
RPSAP8          1.57
AL645608.8      1.46
RNF223          0.496


Along with the above way, I also used kallisto to get tpm. And when I compared both results, they doesn't match and not even close.

target_id          length      eff_length        est_counts     tpm
RPSAP8             889          747.358           4.10538       0.0635304
AL645608.8        2086          1944.36              116         0.689984
RNF223            1902          1760.36           50.0024        0.328508


I actually have 300 samples for which I have raw counts. I wanted to convert them to tpm and using the above functions but they give different results compared to tpm from kallisto.

Which of these functions is to be used for conversion of counts to tpm? Why they don't match to kallisto tpm?

RNA-Seq tpm kallisto rawcounts featureCounts • 493 views
0
Entering edit mode

When compared to TPMs estimated from featureCounts output, the TPM from kallisto should be more correct. As for the reason why, see the answer from Michael Love for the question DESeq2: Is it possible to convert read counts to expression values via TPM and return these values?

I didn't check your functions that calculate TPM, but at least one of them is wrong: if they are using the same input data, and calculating the same value, they should yield the same result. Did you sanity check your TPMs? For example, to they sum to one million?

0
Entering edit mode

If you check the functions, for one of the function only Coding_length is used and for other function effLength which is calculated from Coding_Length is used. And yes, I did the sanity check, the results from both functions give sum to one million. Among the two functions I used one is suggested by Michael Love. Please check it.

And Yes, I can understand that kallisto tpm is better, But I have the raw counts and instead of running kallisto again on all 300 samples, I wanted to convert raw counts to tpm.

0
Entering edit mode

kallisto (or Salmon) would run these 300 samples really quickly, so if consider important get better estimates of TPM, you should run them.

df$effLength <- df$Coding_Length - 203.7 + 1


What is the reasoning behind this effective length? As you are not using the same gene length in both calculation, the TPMs will differ.

0
Entering edit mode

I saw the effective length thing here A review of RNA-Seq expression units. Yes I feel like running kallisto now. One last question. Once I get the transcripts with tpm, can I sum the tpm of transcripts from same gene to make it gene level transcripts?