Normalized Average Rpkm Formula
1
1
Entering edit mode
10.2 years ago
Dima1982 ▴ 20

Hi all, this is my first question.

I am using the Long RNA-seq from ENCODE/Cold Spring Harbor Lab gene expression data and the normalized average RPKM formula provided here:

score = normalized average RPKM [((RPKM1 + RPKM2)/2) *1000]/max(RPKM1 + RPKM2)/2)

does not make any sense to me.

As an example, an average RPKM value of 0.126744 from the first line of this file: A549_cell_longPolyA_CSHL_GeneGencV7.CSHL_LID8963-021WC-b1.LID8964-022WC-b2.gff cannot be derived from RPKM1 = 1.40735 and RPKM2 = 1.61345.

The reason I want to use this formula is I want to combine the values derived from the Calltech experiments, as the CSHL ones are incomplete.

I would appreciate any help with this.

rpkm • 3.4k views
ADD COMMENT
0
Entering edit mode

The word "normalized" is important here. If you have at least one really big RPKM then everything else will be really small.

BTW, isn't the raw data available? It might be easiest to just realign things and compute the RPKMs yourself over the same genesets (then you can avoid this rather odd normalization method).

ADD REPLY
0
Entering edit mode

Yes, that is true. I will do that, if nothing comes up. I also sent an e-mail to the curator, but I got no reply from her. Even if I use run Cufflinks though and come up with my own RPKMs, the combination of values from the 2 replicates is still strange... Thank you for the suggestion. I will wait a bit longer. :)

ADD REPLY
1
Entering edit mode
10.2 years ago
Dima1982 ▴ 20

The curator replied!

The answer:

"Originally these files were formatted to be displayed in the USCS browser as gene models similar to those for the Gencode or Refseq annotations, i.e. boxes for exons, connective lines for introns. The expression was supposed to be represented with grey-shading of the 'boxes' - the more expressed the gene/transcript, the darker the coloring.

In the .bed format which was developed to do that - the 'score' field is reserved to indicate the color value, for which UCSC requires values between 0 and 1. In the RNAseq file we have applied a normalization procedure to adjust the RPKMs to value between 0 and 1.

However, the real average expression for a gene with two RNAseq replicates for the same sample would be (RPKM1+RPKM2) / 2."

I guess, this clears the matter.

ADD COMMENT

Login before adding your answer.

Traffic: 2547 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6