Hi all, this is my first question.
I am using the Long RNA-seq from ENCODE/Cold Spring Harbor Lab gene expression data and the normalized average RPKM formula provided here:
score = normalized average RPKM [((RPKM1 + RPKM2)/2) *1000]/max(RPKM1 + RPKM2)/2)
does not make any sense to me.
As an example, an average RPKM value of 0.126744 from the first line of this file: A549_cell_longPolyA_CSHL_GeneGencV7.CSHL_LID8963-021WC-b1.LID8964-022WC-b2.gff cannot be derived from RPKM1 = 1.40735 and RPKM2 = 1.61345.
The reason I want to use this formula is I want to combine the values derived from the Calltech experiments, as the CSHL ones are incomplete.
I would appreciate any help with this.
The word "normalized" is important here. If you have at least one really big RPKM then everything else will be really small.
BTW, isn't the raw data available? It might be easiest to just realign things and compute the RPKMs yourself over the same genesets (then you can avoid this rather odd normalization method).
Yes, that is true. I will do that, if nothing comes up. I also sent an e-mail to the curator, but I got no reply from her. Even if I use run Cufflinks though and come up with my own RPKMs, the combination of values from the 2 replicates is still strange... Thank you for the suggestion. I will wait a bit longer. :)