Question: Calculating FPKM from HTSeq counts table
gravatar for ann081993
12 months ago by
ann0819930 wrote:


I'm new-bie to RNA-Seq and I'm trying to set up RNA-Seq analysis pipeline. Now, I've got stuck on calculating FPKM from HTSeq counts table.

I've read some articles about FPKM like this and questions on Biostars. But, I' not sure that did I correctly understand because can not reproduce FPKM with public data.

I tried calculation in R and used data from GDC data portal, 551042d1-2551-46bb-a56c-3cc79ed09469.htseq.counts and 551042d1-2551-46bb-a56c-3cc79ed09469.FPKM.txt as an example. (you can get it here)

count <- read.table("551042d1-2551-46bb-a56c-3cc79ed09469.htseq.counts",
                stringsAsFactors = FALSE,
                col.names = c("id", "counts"))
count.fpkm <- read.table("551042d1-2551-46bb-a56c-3cc79ed09469.FPKM.txt",
                     stringsAsFactors = FALSE,
                     col.names = c("id", "fpkm"))

Gene lengths are extracted with scripts from this article.

mrg <- merge(count, exonic.gene.sizes, by = "id") 
                  id counts   size
1 ENSG00000000003.13   1407  12883
2  ENSG00000000005.5      1  15084
3 ENSG00000000419.11   2280  23689
4 ENSG00000000457.12   1525  44637
5 ENSG00000000460.15    992 192074
6 ENSG00000000938.11   1237  23214

N <- sum(mrg$counts)
mrg.fpkm <- transform(mrg, fpkm = counts / size / N * 10^9)

                  id counts   size        fpkm
1 ENSG00000000003.13   1407  12883 1.522140655
2  ENSG00000000005.5      1  15084 0.000923977
3 ENSG00000000419.11   2280  23689 1.341423203
4 ENSG00000000457.12   1525  44637 0.476159595
5 ENSG00000000460.15    992 192074 0.071981482
6 ENSG00000000938.11   1237  23214 0.742672623

plot(x = mrg.fpkm$fpkm, y = count.fpkm$fpkm)

The result is totally different from what I expected. What's wrong with my calculation or what's wrong in my understanding?

rna-seq fpkm R htseq • 2.0k views
ADD COMMENTlink modified 12 months ago • written 12 months ago by ann0819930

FPKM = (1e+06*Readcount_Gene) / (Gene_Length_in_kb * total_read_depth)

So basically the read count per gene, divided by its length and the total read count. This results in a tiny number. Hence, one multiplies with 1mio to make the number more intuitive to use. Still, FPKM is no longer considered a good choice for RNA-seq. Rather than that TPM seems to be more reasonable (google for Lior Pachters blog together with "TPM" for some details on that).

Some suggestions: First, use a published package for RNA-seq analysis, such as edgeR, DESeq, kallisto-sleuth, etc. These tools will do the calculation for you, including proper statistics and FDR corrections. Second, when plotting data of this kind, use log2-transformed values to avoid excessive axis limits, e.g. log2(readcount+1), +1 as log2 of 0 is undefined.

ADD REPLYlink written 12 months ago by ATpoint7.5k

Thanks for your suggestions, and you were right. I found the relationship between fpkm value from data portal and the I calculated. So from now, based on what I understood, I'm going to learn about TPM and other pre-built packages, as you suggested. Thanks.enter image description here

ADD REPLYlink modified 12 months ago • written 12 months ago by ann0819930

Simply logging something may make it look nice but it does not imply that the distribution is reflective of the underlying 'biology' being measured. Please read:

An update (12th August 2018):

You should abandon RPKM / FPKM normalisation. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis: Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

In their key points:

The Total Count and RPKM normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Note - FPKM is essentially the same as RPKM

ADD REPLYlink written 5 weeks ago by Kevin Blighe28k

Video 1: (seriously consider spending 45 minutes to watch this, it's extremely educative)

Reading 1:

Reading 2:

Reading 3:

ADD REPLYlink written 12 months ago by Macspider2.5k

Thank you for information.!

ADD REPLYlink written 12 months ago by ann0819930
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1494 users visited in the last hour