Question

How to get transcripts lengths from RNAseq data to calculate TPM

0

Entering edit mode

3.5 years ago

MatStat ▴ 160

Hi all,

I have raw counts of bulk RNAseq data (Ensemble annotated). I am trying to calculate TPM. I understand mathematically how to do it.

What I don’t understand is how to calculate the lengths of transcripts here. I found a package that actually helps to extract lengths in R: https://www.rdocumentation.org/packages/GenomicFeatures/versions/1.24.4/topics/transcriptLengths

They mention that:

The length of a processed transcript is just the sum of the lengths of its exons. This should not be confounded with the length of the stretch of DNA transcribed into RNA (a.k.a. transcription unit), which can be obtained with width(transcripts(txdb)).

When I apply that method I get duplicates of the genes ID due to information about different transcripts. How should I solve that issue?

Should I sum the exons by summing:

tx_len: The length of the processed transcript.

And then collapse genes id?

Is there a simple way to approach this issue?

All the best

RNA-Seq TPM genomicfeatures bioconductor • 3.9k views

ADD COMMENT • link updated 3.5 years ago by ATpoint 81k • written 3.5 years ago by MatStat ▴ 160

1

Entering edit mode

May I ask upfront for what you plan to use the TPM which was actually developed to compare transcript expression within the same sample?

ADD REPLY • link 3.5 years ago by ATpoint 81k

0

Entering edit mode

Sure, some deconvolution methods require a non-log based transformed data. In addition they suggest TPM for that.

Best

ADD REPLY • link 3.5 years ago by MatStat ▴ 160

1

Entering edit mode

Please have a look at this post

ADD REPLY • link 3.5 years ago by Irsan ★ 7.8k

0

Entering edit mode

updating comment error resolved after starting R new session

Hi,

Thank you for the prompt reply, it seems that this could solve the issue. The thing is I'm getting an error in this line:

exonic.gene.sizes <- as.data.frame(sum(width(reduce(exons.list.per.gene))))

Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': error in evaluating the argument 'x' in selecting a method for function 'width': argument ".f" is missing, with no default

Did you encounter this ?

Best

ADD REPLY • link 3.5 years ago by MatStat ▴ 160

0

Entering edit mode

The raw counts correspond to gene expression estimates or transcript expression estimates?

ADD REPLY • link 3.5 years ago by h.mon 35k

0

Entering edit mode

Hi h.mon,

The raw counts are reads mapped to the genes and they are integers as they are not normalized.

ADD REPLY • link 3.5 years ago by MatStat ▴ 160

score 3 · Answer 1 · 2020-10-15

Standard FPKM, RPKM, TPM have the problem that they do not account for any compositional bias but only for sequencign depth (and gene length). I personally prefer more sophisticated methods that actually correct for composition, e.g. cpm from edgeR:

library(edgeR)
cts <- sapply(seq(1,4), function(x) rnorm(10000,100,1))
y <- DGEList(counts = cts)
y <- calcNormFactors(y, method = "TMM") # ?calcNormFactors for other methods
edgeR.cpm <- cpm(y, log = FALSE)

If I want to correct for gene length then I divide edgeR.cpm by the gene length in kb or use edgeR::rpkm() which does pretty much the same as cpm() so correcting for sequencing depth and composition plus divides by gene length which you have to provide.