I have (poly-A) mRNA-seq data. I want to have RPKM/FPKM values for each gene and I want to provide a cufflink with a gtf file with only exons (or should the transcript information be also included?) annotation in order to get rid of the reads that can fall into intronic/intergenic region. What risk can into with this approach?
Firstly, If I provide only exon information then cufflink will know which reads to count. But as far as I understood, RPKM is reported per isoform. So, how will it assign RPKM value per isoform if I gave him only annotated exons? Will it assemble the annotated exons based on the reads?
Secondly, if I provide only transcript information, it contains only the start and end of a transcript (no information about start and end of the exons), so cufflink will count also intronic regions if they exist, and I want to avoid it.
Thirdly, if I get RPKM per isoform, will be it appropriate to take the average over all isoforms and report it as RPKM per gene?
You are right, there are pro and cons in exon vs transcript information, this is why you can input BOTH levels to Cufflinks ! look at this gtf file exemple with three levels : gene, transcript and exon.
Edit: Concerning your third point, I don't know. I guess it depends on your question and data.