RNA Seq correlation with P value
2
1
Entering edit mode
7.5 years ago
peris ▴ 120

I have RNA seq data-set. Bellow is an example. Where 1st column are the genes and 1st row are the samples. In actual dataset I have over 4000 samples. Now, I have two set of genes. Set1 (A1, A2, A3) is my list of genes of my interest and set2 (B-J) with known disease causing genes. Now, I want to calculate Pearson correlations for each of the set2 genes with respect to my gene and want to retain genes P < 0.05. How can I do this?

N    S1    S2    S3    S4    S5    S6    S7    S8    S8    S10    S11    S12    S13    S14
A1    1943    3056    813    812    1769    2324    360    920    830    582    85    123    1306    1651
A2    39    39    7    10    212    360    100    188    202    178    178    206    52    51
A3    593    704    73    61    204    383    201    245    384    278    369    652    523    651
B    104    43    10    17    397    639    158    294    226    217    802    690    49    80
C    73    58    17    13    638    886    316    553    608    573    564    759    134    173
D    102    54    14    17    204    398    452    454    289    271    66    65    122    113
E    2820    3419    1531    1305    1795    2489    1057    1821    1457    1383    292    241    3672    3678
F    34    169    71    67    949    1014    337    1019    747    668    52    53    1936    812
G    85    117    54    58    37    54    99    146    115    96    48    38    920    512
H    133    202    32    48    42    37    50    105    79    80    24    21    433    277
I    142    172    40    43    39    69    114    122    105    117    169    95    344    172
J    84    132    21    36    49    77    120    159    108    144    206    447    302    98

Pearson RNA-Seq R • 2.6k views
0
Entering edit mode

Note that testing the null hypothesis of correlation = 0 is statistically equivalent to testing slope = 0 in the corresponding linear regression model. Since any stats package will give you this P-value as part of its output for fitting a linear model, this might be an easier way to get what you're looking for.

0
Entering edit mode
7.5 years ago

I'd either write a script, (though you probably don't want to hand write the code to calculate the p-value from the raw Pearson output) or if you have it, Spotfire will do oodles of Pearson's if you ask it to. You can do this in Excel too.

0
Entering edit mode

Thanks swbarnes2. I have a perl script which I use for very small sample size. But in this the sample size is >4000. So, I was looking for some other options.

0
Entering edit mode
7.5 years ago

You can use the R function cor.test. See ?cor.test for parameters. I would consider if it makes sense to use another method than pearsons, because it might be doubted that your count data is approximately normally distributed:

From ?cor.test:

"If method is "pearson", the test statistic is based on Pearson's product moment correlation coefficient cor(x, y) and follows a t distribution with length(x)-2 degrees of freedom if the samples follow independent normal distributions.

If method is "kendall" or "spearman", Kendall's tau or Spearman's rho statistic is used to estimate a rank-based measure of association. These tests may be used if the data do not necessarily come from a bivariate normal distribution.

For Spearman's and Kendall's methods, also exact p-values are available, however these might not be computationally feasible for your >4k samples.

The alternative would be to directly generate a co-expression network in Cytoscape using: http://apps.cytoscape.org/apps/expressioncorrelation from all your genes, then you get a visualization option for free.