I want to use RNAseq data to quantify gene expression with no intention of finding new transcripts. It would be timesaving to align the reads directly to a transcriptome index built from Refseq RNA, instead of aligning it to genome and look for annotations. I use bowtie to align the reads and then am using cufflinks to quantify the reads (cufflinks has no way of differentiating transcriptome alignments from genome alignments). However, I am not sure if cufflinks can calculate FPKM correctly with this.
How to incorporate gene length information (would I have to make a GTF file for that ?).
I did read the methodology of cufflinks given in the supplementary info of the paper but am not exactly sure how it approximates the values. Moreover the statistical model that it uses, considers transcriptome as a subset of genome and equations are written accordingly. I think this particular question must have been asked numerous times but- how exactly does cufflinks calculate FPKM ?
Would it rather be better, in this case, that I write my own script to calculate FPKM (considering one pair as a fragment) ?
Should fragment lengths be normalized ?