I use TopHat and Cufflinks to process some NGS data (human genome). When using cuffdiff to obtain values for the DE I got FPKM values for each group. Since I didn't provide any gtf-file with known exons I wondered how the FPKM values were calculated.
The formula for FPKM is 10^9 * C / (N * L), with C is the number of mappable reads that fell onto the gene's exons (how did the program know this?), N the total number of mappable reads in the experiment and L the number of base pairs in the exon (again, where is this number coming from?).
I used the pipeline provided in the Cufflinks tutorial: http://cufflinks.cbcb.umd.edu/tutorial.html (see below).
Then I read in this forum an answer which is quite related to my current question but I thinks this is opposed to the cufflinks tutorial. So I'm confused now :-)
In another try I used some single end data and also obtained FPKM values. Shouldn't I get a RPKM value for single end reads?
Thanks in advance, Oliver
Here my pipeline (I hope I pooled the samples the right way?):
SRR027863_1.fastq SRR027863_2.fastq SRR027864_1.fastq SRR027864_2.fastq SRR027865_1.fastq SRR027865_2.fastq
SRR027866_1.fastq SRR027866_2.fastq SRR027867_1.fastq SRR027867_2.fastq
tophat -r 160 -o top_SRR027863-65 ../../../reference/hg19 SRR027863_1.fastq,SRR027864_1.fastq,SRR027865_1.fastq SRR027863_2.fastq,SRR027864_2.fastq,SRR027865_2.fastq tophat -r 160 -o top_SRR027866-67 ../../../reference/hg19 SRR027866_1.fastq,SRR027867_1.fastq SRR027866_2.fastq,SRR027867_2.fastq cufflinks -o cuff_SRR027863-65 top_SRR027863-65/accepted_hits.bam cufflinks -o cuff_SRR027866-67 top_SRR027866-67/accepted_hits.bam cuffmerge -s ../../../reference/hg19.fa assemblies.txt cuffdiff merged_asm/merged.gtf top_SRR027863-65/accepted_hits.bam top_SRR027866-67/accepted_hits.bam
Tophat has aligned your reads to the reference genome, and marked those reads which align with and without splice junctions. It's intended for transcriptome analysis, or RNA-Seq. That way, anything that maps unspliced must be an exon and anything that jumps regions must span an intron.
Cufflinks then looks at the distribution of reads and estimates a transcript structure (or more than one at the same region).
It's got a fancy probabilistic assignment method and you should read their paper for details.
So each region(gene) has multiple transcripts and each transcript has multiple exons, but the transcripts in a region share exons and that's why the reads don't map precisely, but probabilistically. That's why instead of giving you read count numbers, it gives this fancy FPKM number.
Tophat might be the wrong tool if you arent doing RNA-Seq, but instead looking purely at genomic DNA. Try to get your data into a viewer like IGV to see if it looks right. IGV can load a SAM or BAM and show you where the reads are, it's nice for reviewing the aligner's work.
If you're after a simpler to interpret system or not interested in alternative splicing/transcripts. Try something like DESeq, a simple read count based system.
As to the question about FPKM vs RPKM, it's just a name, they switched from Reads per... to Fragments per... to clean up confusion regarding paired end reads, now its all about the number of cDNA fragments regardless of which way you do the measurement (SE or PE).