Hi, I'm having a little bit of trouble understanding precisely how RPKM data is calculated for TCGA RNAseq and RNAseqV2 data. Specifically for 'genes' and 'exons'. First of all what do they mean by
a composite gene model was generated by merging all overlapping exons (as defined by the genomic mapping) from each associated reference transcript. Thus, each composite gene model is essentially the union of all associated reference transcripts.
Do they mean they simply took any reads which align to any transcripts of the gene and counted it? Do they mean they counted only reads over overlapping exons and discarded the rest? Or counted only the reads aligning to some some funky model obtained by aligning the transcripts and trimming it?
Also there seems to be some discrepancy to how RPKM is calculated for the 'gene'. Here they simply use gene length which I'm not sure means the mRNA or what
RPKM for a given GeneX is calculated by: (raw read counts Ã— 10^9) / (total reads Ã— length of GeneX).
Here they are calculating RPKM through the sum of exons
RPKM is calculated using the formula: (number of reads mapped to all exons in a gene x 1,000,000,000)/(NORM_TOTAL x sum of the lengths of all exons in the gene ) [Note: NORM_TOTAL = the total number of reads that are mapped to all exons from the composite gene models. (i.e. sum of the fractional read count for all exons)]
Also whatever the answer might be as to the actual method they use. Would it be the same for RNAseq vs RNAseqV2 data? Here are the links I'm looking at.