I have raw counts of bulk RNAseq data (Ensemble annotated). I am trying to calculate TPM. I understand mathematically how to do it.
What I don’t understand is how to calculate the lengths of transcripts here. I found a package that actually helps to extract lengths in R: https://www.rdocumentation.org/packages/GenomicFeatures/versions/1.24.4/topics/transcriptLengths
They mention that:
The length of a processed transcript is just the sum of the lengths of its exons. This should not be confounded with the length of the stretch of DNA transcribed into RNA (a.k.a. transcription unit), which can be obtained with width(transcripts(txdb)).
When I apply that method I get duplicates of the genes ID due to information about different transcripts. How should I solve that issue?
Should I sum the exons by summing:
tx_len: The length of the processed transcript.
And then collapse genes id?
Is there a simple way to approach this issue?
All the best