I am trying to test a software for abundance estimation, and I am trying to think of a way to generate my own set of reads, but knowing the expected values of benchmark relative abundances in advance to make sure I can compare the output to the benchmark. If I have a set of N transcripts, and I generate M reads from these transcripts knowing the origin of each read, can I, using that information estimate expected RPKM or TPM and how? Would TPM for a specific transcript just be num_reads_from_it / num_reads_overall * 10^6??
If you know the number of reads originating from each transcript t (call it n_t), then you can compute TPM_t = 10^6 * [(n_t / l_t) / sum_t' (n_t' / l_t')]. Here l_t is the length of transcript t. Note, this is different than the formula you have above. That computes NPM (nucleotides per million), which is a measure of relative abundance that is *not* normalized for length. Also, I'd avoid FPKM / RPKM completely, there's no benefit relative to TPM, but there are some shortcomings (though it shouldn't really matter when assessing accuracy on simulated data in a single sample).