Normalising Tag Count To Rpkm
1
0
Entering edit mode
10.9 years ago
Dataminer ★ 2.8k

Hi!

I was wondering if their is a way to normalise the number of reads in a region and the RPKM of the nearest gene to that region, so that a correlation could be computed.

Like the following data shows number of tags in first column and RPKM in second column

Tags     RPKM
15        0.14619
11        0
203        0.2259
129        10.701
300        7.0772
122        2.3234
346        10.666
77        3.117
201        16.749

What is the most potent way to normalise/scale the data in two columns so that a correlation can be computed?

Thank you

rna-seq chip-seq • 2.6k views
ADD COMMENT
0
Entering edit mode

Wouldn't that just be computing the RPKM for your region?

ADD REPLY
0
Entering edit mode

The aim is to correlate occupancy with RPKM

ADD REPLY
2
Entering edit mode
10.9 years ago
Michael 54k

Calculating correlation doesn't require scaling, use Spearman's rank correlation if data is possibly not normal.

ADD COMMENT
0
Entering edit mode

But if it needs to be plotted against each other then it should be somewhat scaled (and I believe also for correlation) as these are absolute values.

ADD REPLY
1
Entering edit mode

Believe it or not correlation is scale invariant ;) To try it out, it is easiest to start with R. I it has function cor and scale. scale is used to scale columns of a matrix to unit variance and mean 0. That is possibly useful for plotting and classification.

ADD REPLY
0
Entering edit mode

Micheal : Believe U. The problem I am having is to show that number of tags are some how correlated to RPKM. Yes Spearman rank correlation works but i don't have the desired result but that is always the case in science ;) and in life.

ADD REPLY
1
Entering edit mode

It is not surprising that the correlation is not perfect, that comes from the way RPKM is calculated. There are only two variables that influence the correlation: read count and exon length; the number of bases sequenced is constant for each sample. So exon length is the additional source of variance. That is also why some people think that RPKM introduces another gene-length dependent bias instead of resolving bias.

ADD REPLY

Login before adding your answer.

Traffic: 2132 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6