Estimating RPKM or TPM in RNA-Seq data
1
0
Entering edit mode
7.0 years ago
darxsys ▴ 210

I am trying to test a software for abundance estimation, and I am trying to think of a way to generate my own set of reads, but knowing the expected values of benchmark relative abundances in advance to make sure I can compare the output to the benchmark. If I have a set of N transcripts, and I generate M reads from these transcripts knowing the origin of each read, can I, using that information estimate expected RPKM or TPM and how? Would TPM for a specific transcript just be  num_reads_from_it / num_reads_overall * 10^6??

RNA-Seq rpkm • 2.9k views
0
Entering edit mode
3
Entering edit mode
7.0 years ago
Rob 5.3k

If you know the number of reads originating from each transcript t (call it n_t), then you can compute TPM_t = 10^6 * [(n_t / l_t)  /  sum_t' (n_t' / l_t')].  Here l_t is the length of transcript t.  Note, this is different than the formula you have above.  That computes NPM (nucleotides per million), which is a measure of relative abundance that is *not* normalized for length.  Also, I'd avoid FPKM / RPKM completely, there's no benefit relative to TPM, but there are some shortcomings (though it shouldn't really matter when assessing accuracy on simulated data in a single sample).

1
Entering edit mode

Yes, you're right. I read that in this paper too and forgot about it. What I wrote is an estimate of NPM and can easily be converted to TPM or TPM can be calculated from your formulas. I am also aware of TPM benefits and RPKM drawbacks, but as you said, it should not make a whole lot of difference for my single sample, especially because I am not doing differential expression analysis. Thanks!