Question: Calculating FPKM from HTSeq counts table
gravatar for ann081993
9 months ago by
ann0819930 wrote:


I'm new-bie to RNA-Seq and I'm trying to set up RNA-Seq analysis pipeline. Now, I've got stuck on calculating FPKM from HTSeq counts table.

I've read some articles about FPKM like this and questions on Biostars. But, I' not sure that did I correctly understand because can not reproduce FPKM with public data.

I tried calculation in R and used data from GDC data portal, 551042d1-2551-46bb-a56c-3cc79ed09469.htseq.counts and 551042d1-2551-46bb-a56c-3cc79ed09469.FPKM.txt as an example. (you can get it here)

count <- read.table("551042d1-2551-46bb-a56c-3cc79ed09469.htseq.counts",
                stringsAsFactors = FALSE,
                col.names = c("id", "counts"))
count.fpkm <- read.table("551042d1-2551-46bb-a56c-3cc79ed09469.FPKM.txt",
                     stringsAsFactors = FALSE,
                     col.names = c("id", "fpkm"))

Gene lengths are extracted with scripts from this article.

mrg <- merge(count, exonic.gene.sizes, by = "id") 
                  id counts   size
1 ENSG00000000003.13   1407  12883
2  ENSG00000000005.5      1  15084
3 ENSG00000000419.11   2280  23689
4 ENSG00000000457.12   1525  44637
5 ENSG00000000460.15    992 192074
6 ENSG00000000938.11   1237  23214

N <- sum(mrg$counts)
mrg.fpkm <- transform(mrg, fpkm = counts / size / N * 10^9)

                  id counts   size        fpkm
1 ENSG00000000003.13   1407  12883 1.522140655
2  ENSG00000000005.5      1  15084 0.000923977
3 ENSG00000000419.11   2280  23689 1.341423203
4 ENSG00000000457.12   1525  44637 0.476159595
5 ENSG00000000460.15    992 192074 0.071981482
6 ENSG00000000938.11   1237  23214 0.742672623

plot(x = mrg.fpkm$fpkm, y = count.fpkm$fpkm)

The result is totally different from what I expected. What's wrong with my calculation or what's wrong in my understanding?

rna-seq fpkm R htseq • 1.5k views
ADD COMMENTlink modified 9 months ago • written 9 months ago by ann0819930

FPKM = (1e+06*Readcount_Gene) / (Gene_Length_in_kb * total_read_depth)

So basically the read count per gene, divided by its length and the total read count. This results in a tiny number. Hence, one multiplies with 1mio to make the number more intuitive to use. Still, FPKM is no longer considered a good choice for RNA-seq. Rather than that TPM seems to be more reasonable (google for Lior Pachters blog together with "TPM" for some details on that).

Some suggestions: First, use a published package for RNA-seq analysis, such as edgeR, DESeq, kallisto-sleuth, etc. These tools will do the calculation for you, including proper statistics and FDR corrections. Second, when plotting data of this kind, use log2-transformed values to avoid excessive axis limits, e.g. log2(readcount+1), +1 as log2 of 0 is undefined.

ADD REPLYlink written 9 months ago by ATpoint4.4k

Thanks for your suggestions, and you were right. I found the relationship between fpkm value from data portal and the I calculated. So from now, based on what I understood, I'm going to learn about TPM and other pre-built packages, as you suggested. Thanks.enter image description here

ADD REPLYlink modified 9 months ago • written 9 months ago by ann0819930

Video 1: (seriously consider spending 45 minutes to watch this, it's extremely educative)

Reading 1:

Reading 2:

Reading 3:

ADD REPLYlink written 9 months ago by Macspider2.4k

Thank you for information.!

ADD REPLYlink written 9 months ago by ann0819930
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2098 users visited in the last hour