RNA Seq correlation with P value
2
1
Entering edit mode
8.7 years ago
peris ▴ 120

I have RNA seq data-set. Bellow is an example. Where 1st column are the genes and 1st row are the samples. In actual dataset I have over 4000 samples. Now, I have two set of genes. Set1 (A1, A2, A3) is my list of genes of my interest and set2 (B-J) with known disease causing genes. Now, I want to calculate Pearson correlations for each of the set2 genes with respect to my gene and want to retain genes P < 0.05. How can I do this?

N    S1    S2    S3    S4    S5    S6    S7    S8    S8    S10    S11    S12    S13    S14
A1    1943    3056    813    812    1769    2324    360    920    830    582    85    123    1306    1651
A2    39    39    7    10    212    360    100    188    202    178    178    206    52    51
A3    593    704    73    61    204    383    201    245    384    278    369    652    523    651
B    104    43    10    17    397    639    158    294    226    217    802    690    49    80
C    73    58    17    13    638    886    316    553    608    573    564    759    134    173
D    102    54    14    17    204    398    452    454    289    271    66    65    122    113
E    2820    3419    1531    1305    1795    2489    1057    1821    1457    1383    292    241    3672    3678
F    34    169    71    67    949    1014    337    1019    747    668    52    53    1936    812
G    85    117    54    58    37    54    99    146    115    96    48    38    920    512
H    133    202    32    48    42    37    50    105    79    80    24    21    433    277
I    142    172    40    43    39    69    114    122    105    117    169    95    344    172
J    84    132    21    36    49    77    120    159    108    144    206    447    302    98
Pearson RNA-Seq R • 2.9k views
ADD COMMENT
0
Entering edit mode

Note that testing the null hypothesis of correlation = 0 is statistically equivalent to testing slope = 0 in the corresponding linear regression model. Since any stats package will give you this P-value as part of its output for fitting a linear model, this might be an easier way to get what you're looking for.

ADD REPLY
0
Entering edit mode
8.7 years ago

I'd either write a script, (though you probably don't want to hand write the code to calculate the p-value from the raw Pearson output) or if you have it, Spotfire will do oodles of Pearson's if you ask it to. You can do this in Excel too.

ADD COMMENT
0
Entering edit mode

Thanks swbarnes2. I have a perl script which I use for very small sample size. But in this the sample size is >4000. So, I was looking for some other options.

ADD REPLY
0
Entering edit mode
8.7 years ago
Michael 54k

You can use the R function cor.test. See ?cor.test for parameters. I would consider if it makes sense to use another method than pearsons, because it might be doubted that your count data is approximately normally distributed:

From ?cor.test:

"If method is "pearson", the test statistic is based on Pearson's product moment correlation coefficient cor(x, y) and follows a t distribution with length(x)-2 degrees of freedom if the samples follow independent normal distributions.

If method is "kendall" or "spearman", Kendall's tau or Spearman's rho statistic is used to estimate a rank-based measure of association. These tests may be used if the data do not necessarily come from a bivariate normal distribution.

For Spearman's and Kendall's methods, also exact p-values are available, however these might not be computationally feasible for your >4k samples.

The alternative would be to directly generate a co-expression network in Cytoscape using: http://apps.cytoscape.org/apps/expressioncorrelation from all your genes, then you get a visualization option for free.

ADD COMMENT

Login before adding your answer.

Traffic: 1892 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6