2.4 years ago by

United States

Warning: I don't do RNA-seq often; my comments below may be inaccurate.

I was looking at an RNA-seq data set where only FPKM is provided. As I need raw read counts for edgeR-like analyses, I did a small research on how FPKM and the related TPM are calculated. I have also consulted Rob Patro for help. In the end, it seems to me that there are multiple subtly different ways to compute FPKM and TPM. FPKM/etc computed by different tools are often not comparable.

I think the most precise description of FPKM/etc is here. Importantly, to derive FPKM/etc from raw read counts, we need to compute the effective transcript length (the \tilde{l} in the link above). The exact approach to computing this value is tool dependent. Rob commented that:

A different approach [to computing effective length] (which is used in Salmon and kallisto) is to define the effective length of a transcript as L - \mu_{L}, where \mu_{L} is the mean of the fragment length distribution for all fragments of length <= L.

and mentioned that "the effective length can also be modified to account for sampling biases". There is not a single way to compute effective length and thus not a single way to compute FPKM/TPM.

As a side note, GTEx provided both raw counts and FPKM. I was trying to convert from counts to FPKM. However, it seems that GTEx is using an effective length longer than the transcript length, which would be impossible with Rob's formula or the formula in the link above...

In all, your question is not only about how to compute FPKM/TPM, but is also related to which flavor of FPKM/TPM to compute. If I were given such a task, I would take Rob's formula to compute effective length and the TPM formula in the linked webpage. Note that you need to know the insert size/fragment length distribution of your library in order to compute TPM accurately.

•

link
written
2.4 years ago by
lh3 ♦ **31k**