Perhaps this has already been answered somewhere, but I am not seeing a satisfactory explanation. I want to understand how one calculates FPKM (fragments per kilobase of exon per million fragments mapped) in RNA-seq data. Everywhere I look, I see people saying that it is the number of reads aligned per kilobases of the transcript per million mappable read from the total dataset, and that the difference between RPKM and FPKM is that one fragment is a pair of reads for paired end data. If I have any aspect of that wrong, please inform me.
If the above is right, then how is it that Cufflinks is able to find transcripts that are as low as 10^-12 FPKM? How is that possible?
So I have tried to do a back of the envelope calculation on a gene that has a very low FPKM as reported by Cufflinks. This gene's total combined exons are ~3 kb. It has ~2000 reads aligned by Tophat and the dataset has ~24 million reads in total. If I am understanding how to calculate it, it seems like the gene's FPKM should be 28 or at least somewhere near that order of magnitude. Instead the Cufflinks output says that is has a FPKM of 2.9531e-12. What am I missing here/doing wrong? How can any transcript have such a low FPKM/RPKM? If the dataset size is in the range of 10-100 million reads, then to get a number like 10^-12, with even just 1 read/fragment you would need a transcript that is larger than the size of the human genome?
So I know I must not be understanding this right. Thank you in advance for your help!
you are totally right, there is no way of getting near that number using the RPKM formula ( 2000/(3000*2.4e7) ~ 28). Is the FPKM formula maybe different? Is is documented how cufflinks calculates this?
Paired-end based "fragments per kilobase of exon per million fragments mapped" (FPKM) is analagous to single-end based "reads aligned per kilobases mapped" (RPKM) and is "simply a nomenclature change to better reflect what RNA-Seq actually measures".
Cuffflinks uses a statistical model to calculate FPKM.. It's given in the supplementary methods of the cufflinks paper. Even while running cufflinks you have to input the mean and variance of the read length distribution (for single reads). The results vary with different parameters.