Question

Normalising Tag Count To Rpkm

0

Entering edit mode

10.9 years ago

Dataminer ★ 2.8k

Hi!

I was wondering if their is a way to normalise the number of reads in a region and the RPKM of the nearest gene to that region, so that a correlation could be computed.

Like the following data shows number of tags in first column and RPKM in second column

Tags     RPKM
15        0.14619
11        0
203        0.2259
129        10.701
300        7.0772
122        2.3234
346        10.666
77        3.117
201        16.749

What is the most potent way to normalise/scale the data in two columns so that a correlation can be computed?

Thank you

rna-seq chip-seq • 2.6k views

ADD COMMENT • link updated 10.9 years ago by Michael 54k • written 10.9 years ago by Dataminer ★ 2.8k

0

Entering edit mode

Wouldn't that just be computing the RPKM for your region?

ADD REPLY • link 10.9 years ago by Istvan Albert 100k

0

Entering edit mode

The aim is to correlate occupancy with RPKM

ADD REPLY • link 10.9 years ago by Dataminer ★ 2.8k

score 2 · Answer 1 · 2013-06-17

2

Entering edit mode

10.9 years ago

Michael 54k

Calculating correlation doesn't require scaling, use Spearman's rank correlation if data is possibly not normal.

ADD COMMENT • link 10.9 years ago by Michael 54k

0

Entering edit mode

But if it needs to be plotted against each other then it should be somewhat scaled (and I believe also for correlation) as these are absolute values.

ADD REPLY • link 10.9 years ago by Dataminer ★ 2.8k

1

Entering edit mode

Believe it or not correlation is scale invariant ;) To try it out, it is easiest to start with R. I it has function cor and scale. scale is used to scale columns of a matrix to unit variance and mean 0. That is possibly useful for plotting and classification.

ADD REPLY • link 10.9 years ago by Michael 54k

0

Entering edit mode

Micheal : Believe U. The problem I am having is to show that number of tags are some how correlated to RPKM. Yes Spearman rank correlation works but i don't have the desired result but that is always the case in science ;) and in life.

ADD REPLY • link 10.9 years ago by Dataminer ★ 2.8k

1

Entering edit mode

It is not surprising that the correlation is not perfect, that comes from the way RPKM is calculated. There are only two variables that influence the correlation: read count and exon length; the number of bases sequenced is constant for each sample. So exon length is the additional source of variance. That is also why some people think that RPKM introduces another gene-length dependent bias instead of resolving bias.

ADD REPLY • link 10.9 years ago by Michael 54k