Question

Normalized Average Rpkm Formula

1

Entering edit mode

10.2 years ago

Dima1982 ▴ 20

Hi all, this is my first question.

I am using the Long RNA-seq from ENCODE/Cold Spring Harbor Lab gene expression data and the normalized average RPKM formula provided here:

score = normalized average RPKM [((RPKM1 + RPKM2)/2) *1000]/max(RPKM1 + RPKM2)/2)

does not make any sense to me.

As an example, an average RPKM value of 0.126744 from the first line of this file: A549_cell_longPolyA_CSHL_GeneGencV7.CSHL_LID8963-021WC-b1.LID8964-022WC-b2.gff cannot be derived from RPKM1 = 1.40735 and RPKM2 = 1.61345.

The reason I want to use this formula is I want to combine the values derived from the Calltech experiments, as the CSHL ones are incomplete.

I would appreciate any help with this.

rpkm • 3.4k views

ADD COMMENT • link 10.2 years ago by Dima1982 ▴ 20

0

Entering edit mode

The word "normalized" is important here. If you have at least one really big RPKM then everything else will be really small.

BTW, isn't the raw data available? It might be easiest to just realign things and compute the RPKMs yourself over the same genesets (then you can avoid this rather odd normalization method).

ADD REPLY • link 10.2 years ago by Devon Ryan 104k

0

Entering edit mode

Yes, that is true. I will do that, if nothing comes up. I also sent an e-mail to the curator, but I got no reply from her. Even if I use run Cufflinks though and come up with my own RPKMs, the combination of values from the 2 replicates is still strange... Thank you for the suggestion. I will wait a bit longer. :)

ADD REPLY • link 10.2 years ago by Dima1982 ▴ 20

score 1 · Answer 1 · 2014-02-11

The curator replied!

The answer:

"Originally these files were formatted to be displayed in the USCS browser as gene models similar to those for the Gencode or Refseq annotations, i.e. boxes for exons, connective lines for introns. The expression was supposed to be represented with grey-shading of the 'boxes' - the more expressed the gene/transcript, the darker the coloring.

In the .bed format which was developed to do that - the 'score' field is reserved to indicate the color value, for which UCSC requires values between 0 and 1. In the RNAseq file we have applied a normalization procedure to adjust the RPKMs to value between 0 and 1.

However, the real average expression for a gene with two RNAseq replicates for the same sample would be (RPKM1+RPKM2) / 2."

I guess, this clears the matter.