Question: Calculating FPKM from HTSeq counts table
gravatar for ann081993
2.5 years ago by
ann0819930 wrote:


I'm new-bie to RNA-Seq and I'm trying to set up RNA-Seq analysis pipeline. Now, I've got stuck on calculating FPKM from HTSeq counts table.

I've read some articles about FPKM like this and questions on Biostars. But, I' not sure that did I correctly understand because can not reproduce FPKM with public data.

I tried calculation in R and used data from GDC data portal, 551042d1-2551-46bb-a56c-3cc79ed09469.htseq.counts and 551042d1-2551-46bb-a56c-3cc79ed09469.FPKM.txt as an example. (you can get it here)

count <- read.table("551042d1-2551-46bb-a56c-3cc79ed09469.htseq.counts",
                stringsAsFactors = FALSE,
                col.names = c("id", "counts"))
count.fpkm <- read.table("551042d1-2551-46bb-a56c-3cc79ed09469.FPKM.txt",
                     stringsAsFactors = FALSE,
                     col.names = c("id", "fpkm"))

Gene lengths are extracted with scripts from this article.

mrg <- merge(count, exonic.gene.sizes, by = "id") 
                  id counts   size
1 ENSG00000000003.13   1407  12883
2  ENSG00000000005.5      1  15084
3 ENSG00000000419.11   2280  23689
4 ENSG00000000457.12   1525  44637
5 ENSG00000000460.15    992 192074
6 ENSG00000000938.11   1237  23214

N <- sum(mrg$counts)
mrg.fpkm <- transform(mrg, fpkm = counts / size / N * 10^9)

                  id counts   size        fpkm
1 ENSG00000000003.13   1407  12883 1.522140655
2  ENSG00000000005.5      1  15084 0.000923977
3 ENSG00000000419.11   2280  23689 1.341423203
4 ENSG00000000457.12   1525  44637 0.476159595
5 ENSG00000000460.15    992 192074 0.071981482
6 ENSG00000000938.11   1237  23214 0.742672623

plot(x = mrg.fpkm$fpkm, y = count.fpkm$fpkm)

The result is totally different from what I expected. What's wrong with my calculation or what's wrong in my understanding?

rna-seq fpkm R htseq • 4.2k views
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by ann0819930

FPKM = (1e+06*Readcount_Gene) / (Gene_Length_in_kb * total_read_depth)

So basically the read count per gene, divided by its length and the total read count. This results in a tiny number. Hence, one multiplies with 1mio to make the number more intuitive to use. Still, FPKM is no longer considered a good choice for RNA-seq. Rather than that TPM seems to be more reasonable (google for Lior Pachters blog together with "TPM" for some details on that).

Some suggestions: First, use a published package for RNA-seq analysis, such as edgeR, DESeq, kallisto-sleuth, etc. These tools will do the calculation for you, including proper statistics and FDR corrections. Second, when plotting data of this kind, use log2-transformed values to avoid excessive axis limits, e.g. log2(readcount+1), +1 as log2 of 0 is undefined.

ADD REPLYlink written 2.5 years ago by ATpoint30k

Thanks for your suggestions, and you were right. I found the relationship between fpkm value from data portal and the I calculated. So from now, based on what I understood, I'm going to learn about TPM and other pre-built packages, as you suggested. Thanks.enter image description here

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by ann0819930

Simply logging something may make it look nice but it does not imply that the distribution is reflective of the underlying 'biology' being measured. Please read:

An update (6th October 2018):

You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:

Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

ADD REPLYlink modified 16 months ago • written 18 months ago by Kevin Blighe55k

Video 1: (seriously consider spending 45 minutes to watch this, it's extremely educative)

Reading 1:

Reading 2:

Reading 3:

ADD REPLYlink written 2.5 years ago by Macspider3.0k

Thank you for information.!

ADD REPLYlink written 2.5 years ago by ann0819930
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1674 users visited in the last hour