Does TPM account for fragments?
3
4
Entering edit mode
5.5 years ago
rrcutler ▴ 120

Hello all,

I understand the differences between RPKM, which normalizes for read depth and then gene length; FPKM, the same thing as RPKM but does not count reads twice if both came from the same fragment; and TPM which normalizes for gene length, then read length - allowing comparison of read count proportion between samples.

However, I have not come across any literature on whether the same principle of not counting reads twice from the same fragment (FPKM) has been applied to the TPM method. I don't doubt that this is done, I would just like to know how.

Thanks

RNA-Seq • 3.7k views
5
Entering edit mode
5.5 years ago
Rob 5.1k

Yes, the TPM measure accounts for fragments. In a single-end library, each fragment is represented by a sequencing read, while in a paired-end library, each fragment is represented by a read pair. If r_i is the number of fragments mapping to transcript t_i then we can define

TPM_i = 1,000,000 * [(r_i / l_i) / (Σ_j (r_j / l_j))]

where l_i is the effective length of transcript i. The denominator normalizes for the length scaled abundance of all transcripts in this sample. In a single-end library r_i represents the number of reads mapping to transcript t_i, while in a paired-end library, it represents the number of read pairs (ignoring, for simplicity, that an orphaned read is representative of the entire pair for this purpose).

0
Entering edit mode

Is it not the case that because the number of reads in a paired end library is twice the number of fragments, and because r_i appears in both the numerator and the denominator, the calculated TPM will be the same whether you use fragments or reads?

1
Entering edit mode

0
Entering edit mode

Hey Rob thanks for the detailed response to this question. Still I think I might be missing out on really understanding the average fragment length and it's relation to effective transcript length.

So if you had the time I was wondering if you could answer a question about the importance of including the 'estimated fragment length' in the calculation. Under the assumption that you are only comparing samples within an experiment (and that your goal was to have an average fragment length of size X for all samples). Is including that value really essential..? (Is it perhaps that fragment length varies rather significantly from sample to sample and therefore is included)

Also for the effective length of a transcript should you really be subtracting the read length? (As seen in this script: https://www.biostars.org/p/171766/#171837) I could understand if you count 'reads overlapping a transcript' in an 'intersection_strict' (http://www-huber.embl.de/users/anders/HTSeq/doc/_images/count_modes.png) manner [in this case its very clear that the effective transcript length is more accurate] but if for instance you count overlaps in a 'union' manner wouldn't this change the underlying assumption (that at the edge of the transcript annotation you could have a read with a single nt overlap).

I noticed you didn't mention it in your previous post concerning TPM: A: Calculating TPM from featureCounts output

1
Entering edit mode

Hi,

Let me try to answer your question. First, the relation of the fragment length distribution (and thus the average fragment length) to the effective transcript length.

The motivation for effective length correction is that not all positions on a transcript are equally "sequenceable". Forget, for a moment, about the technicalities of union, intersection, etc. (since these notions don't exist in the context of transcript quantification). Instead, think about drawing sequences from an actual transcript. Imagine I have a transcript of length 1000. Starting from, say, the 100th position, I could draw a fragment of length 100, 200, 300, etc. all the way up through 900. This is because if I start the fragment at position 100, there are 900 sequenceable bases downstream of the fragment start site. Now, 900 is likely larger than almost all of the actual fragments in my library. In other words, the length of the fragment I could draw starting at this position is effectively unconstrained. If we were to add up the probabilities, under our empirical fragment length distribution, corresponding to all sequenceable fragment lengths that we could draw starting at position 100, this sum would be very close to 1. Now, imagine instead that a sequenced fragment from the same transcript starts at position 950. What fragment lengths could be sequenced? Well, any fragment of length greater than 50 is not possible, since, starting at position 950, there are only 50 sequenceable bases remaining in the transcript. Thus, if we were to add up the probabilities, under our empirical fragment length distribution, corresponding to all sequenceable fragment lengths that we could draw starting at position 950, this sum would be very small (the empirical probability of seeing a fragment of length 1 or 2 or 3 or ... or 50). Thus, given the fragment length distribution, we know that not all positions on a transcript are equally "sequenceable", and this is exactly what the effective length accounts for. The effective length of the transcript is simply the sum of the sequenceable probabilities, summed over every possible starting position in the transcript. This will be small near the ends of the transcripts, and essentially 1 away from the ends. This value (the effective length) is usually very well approximated by taking the transcript's original length minus the average fragment length (as determined by the empirical fragment length distribution).

So regarding you second question, the script in the post you reference seems to subtract the average fragment length from the feature length, which agrees with the approximation I mention above. However, one potential issue with that script, and certainly something that may add to confusion, is that manner of effective length correction doesn't really make sense in the "gene counting" model. This is because the effective length is a property of the transcript (i.e. the target from which the fragments are actually being drawn). When you do gene-level analysis by simply counting the reads that hit a gene model, you really don't know the length of the transcript that is generating each fragment. Crucially, for the purposes of computing a relative abundance metric like TPM, subtracting the mean fragment length from the "gene model" is not equivalent to subtracting the mean fragment length from each of the constituent transcripts of that gene that might generate fragments at different rates. This is why tools like RSEM and tximport derive an average gene length by looking at an abundance weighted average of transcript lengths. In order for the effective length correction to make sense, you have to account for the feature from which you're actually sampling the fragments, and the union, intersection, or union-intersection gene models do not constitute such features. Let me know if any of this clarifies what I've written above.

0
Entering edit mode
5.5 years ago

My understanding is that TPM_i = 1,000,000*FPKM_i/sum(FPKM) the fragment counting should come out in the wash.

That is, naively, RPKM = 2* FPKM, so

TPM = 1e6 * RPKM_i/sum(RPKM)
= 1e6 * 2 * FPKM_i/sum(2*FPKM)
= 2e6 * FPKM_i/2*sum(FPKM)
= 1e6 * FPKM_i/sum(FPKM)


so it doesn't matter if you use FPKM or RPKM when calculating TPM.

0
Entering edit mode
5.5 years ago
igor 12k

TPM is "Transcripts Per Million". FPKM is "Fragments Per Kilobase of transcript per Million mapped reads". TPM does not account for gene length.

One transcript would be one fragment, which is one read for single-end or a read pair for paired-end sequencing.

1
Entering edit mode

Just a comment but I'm pretty sure TPM does account for gene length.

0
Entering edit mode

It looks like there are two different TPMs (transcripts per million and transcripts per kilobase million) that are used interchangeably for some reason.

For example, popular and relatively old package RSEM specifically does not consider transcript length for TPM:

These (possibly rounded) counts may be used by a differential expression method such as edgeR [9] or DESeq [8]. The second measure of abundance is the estimated fraction of transcripts made up by a given isoform or gene. This measure can be used directly as a value between zero and one or can be multiplied by 106 to obtain a measure in terms of transcripts per million (TPM). The transcript fraction measure is preferred over the popular RPKM [18] and FPKM [6] measures because it is independent of the mean expressed transcript length and is thus more comparable across samples and species [7].

Also edgeR: https://support.bioconductor.org/p/33996/

1
Entering edit mode

I think there may be some confusion due to the wording, but RSEM does normalize for gene length when computing TPM. The following follows the notation in the original RSEM paper. Let τ_i (Equation (1)) be the fraction of transcripts in the sample equivalent to isoform i. Here, τ_i is a length-normalized measure of the abundance of this isoform. It is length-normalized because we consider the τ_i fraction to be proportional to the number of copies of transcript i in the sample (implicitly accounting for the fact that the number of reads that would be generated from longer transcripts is a priori larger than the number of reads that would be generated by shorter transcripts, even if the same number of copies of each is present in the sample). From this transcript fraction τ_i, they define the nucleotide fraction 𝝼_i = (τ_i * l_i) / (Σ_j (τ_j * l_j)) (Equation (2)). This measure 𝝼_i is called the nucleotide fraction, and is directly proportional to the number of reads we expect to sample from each transcript. Take note of the multiplication by l_i in the computation of 𝝼_i; this is because we expect, by chance, to sample more reads from longer transcripts. Thus 𝝼_i, the nucleotide fraction of transcript i, is not length-normalized. The transcripts per million (TPM) metric is obtained by multiplying τ_i by 10^6 (right column of page 2). Thus, it is directly proportional to the transcript fraction, and accounts for the transcript length (i.e. is length-normalized).

0
Entering edit mode

But then why do they specifically say that it's independent of transcript length unlike RPKM/FPKM?

1
Entering edit mode

ahh.. I see the confusion. TPM is not independent of transcript length, it is independent of the mean expressed transcript length. That is RPKM / FPKM contain an arbitrary scaling factor (that can change between samples) that depends on the average expressed transcript length in the sample. The TPM does not contain this arbitrary scaling factor, though it is still a relative measure of abundance.

1
Entering edit mode

I thought I saw someone mention that TPM does not take gene length into account, but maybe they were wrong also. There really should be a K in there to avoid any confusion.