Question: Normalized Average Rpkm Formula
gravatar for Dima1982
6.9 years ago by
Dima198220 wrote:

Hi all, this is my first question.

I am using the Long RNA-seq from ENCODE/Cold Spring Harbor Lab gene expression data and the normalized average RPKM formula provided here:

score = normalized average RPKM [((RPKM1 + RPKM2)/2) *1000]/max(RPKM1 + RPKM2)/2)

does not make any sense to me.

As an example, an average RPKM value of 0.126744 from the first line of this file: A549_cell_longPolyA_CSHL_GeneGencV7.CSHL_LID8963-021WC-b1.LID8964-022WC-b2.gff cannot be derived from RPKM1 = 1.40735 and RPKM2 = 1.61345.

The reason I want to use this formula is I want to combine the values derived from the Calltech experiments, as the CSHL ones are incomplete.

I would appreciate any help with this.

rpkm • 2.6k views
ADD COMMENTlink modified 6.9 years ago • written 6.9 years ago by Dima198220

The word "normalized" is important here. If you have at least one really big RPKM then everything else will be really small.

BTW, isn't the raw data available? It might be easiest to just realign things and compute the RPKMs yourself over the same genesets (then you can avoid this rather odd normalization method).

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by Devon Ryan98k

Yes, that is true. I will do that, if nothing comes up. I also sent an e-mail to the curator, but I got no reply from her. Even if I use run Cufflinks though and come up with my own RPKMs, the combination of values from the 2 replicates is still strange... Thank you for the suggestion. I will wait a bit longer. :)

ADD REPLYlink written 6.9 years ago by Dima198220
gravatar for Dima1982
6.9 years ago by
Dima198220 wrote:

The curator replied!

The answer:

"Originally these files were formatted to be displayed in the USCS browser as gene models similar to those for the Gencode or Refseq annotations, i.e. boxes for exons, connective lines for introns. The expression was supposed to be represented with grey-shading of the 'boxes' - the more expressed the gene/transcript, the darker the coloring.

In the .bed format which was developed to do that - the 'score' field is reserved to indicate the color value, for which UCSC requires values between 0 and 1. In the RNAseq file we have applied a normalization procedure to adjust the RPKMs to value between 0 and 1.

However, the real average expression for a gene with two RNAseq replicates for the same sample would be (RPKM1+RPKM2) / 2."

I guess, this clears the matter.

ADD COMMENTlink written 6.9 years ago by Dima198220
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1360 users visited in the last hour