Calculate the Pearson correlation and associated p value for multiple variables
1
0
Entering edit mode
2.9 years ago
Cp.Recker • 0

Hi all,

I am working with the huge microarray expression data set. I have the expression value of 27000 probes representing 5500 genes across 14 different data points (Variables: D1 to D14). Among these 5500 genes, few genes are represented by multiple probes (i.e., different probes for the same gene). The distribution of probe representation for 5500 genes varies from 1 to 5 (meaning few genes have 1 or 2 or 3 or 4 or 5 probes). Now, I want to compute Pairwise Pearson Correlation Coefficient and associated P-value for all the possible combinations of multiple probes of the same gene across 14 different data points (14 variables) and export the result in a 1-Dimensional format. A small portion of my input data table in CSV format is shown below

ProbeName Gene D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14
A1 A 9.1 6.6 8.2 9.3 9.0 8.8 9.9 7.5 10.8 9.0 8.3 11.6 9.3 10.9
A2 A 3.9 3.7 5.8 2.2 2.9 2.8 2.9 3.8 3.3 1.7 3.2 3.5 5.9 3.7
A3 A 4.6 4.8 6.8 2.8 4.3 3.5 4.2 5.3 4.5 3.3 4.0 4.3 6.9 4.7
A4 A 3.8 3.9 5.8 3.2 4.0 2.8 3.7 4.6 3.6 2.2 3.8 4.3 5.6 3.9
A5 A 6.3 6.6 7.7 5.9 5.9 5.6 6.2 6.4 5.8 4.9 5.4 6.1 7.7 6.9
B1 B 7.5 5.5 7.1 10.2 7.2 8.6 8.3 7.1 6.1 7.0 9.2 6.4 6.4 9.4
B2 B 4.6 4.8 5.6 4.3 4.7 4.3 4.0 5.5 4.0 3.3 3.8 5.0 5.7 4.7
B3 B 5.1 3.9 5.1 6.5 5.0 5.4 4.9 5.3 4.5 4.5 5.9 5.0 4.6 5.6
B4 B 7.6 6.1 7.5 10.9 8.0 9.2 8.5 7.1 6.3 7.4 10.0 6.9 6.9 10.2
C1 C 3.1 6.1 3.4 2.5 3.7 3.3 2.7 5.0 2.3 3.1 2.0 3.8 2.6 3.3
C2 C 3.8 7.1 4.8 4.1 4.9 4.5 3.8 5.9 4.0 4.7 4.4 5.1 2.9 4.8
C3 C 3.8 6.1 5.5 5.4 6.3 3.9 3.4 7.8 5.3 5.7 4.8 4.0 3.5 4.3
D1 D 12.2 11.7 11.4 10.5 11.5 11.4 10.7 12.0 11.3 10.5 9.9 11.7 10.5 10.2
D2 D 12.0 11.5 11.3 10.4 11.4 11.4 10.7 11.9 11.2 10.6 9.9 11.7 10.3 10.2
E1 E 2.4 3.3 7.5 3.4 5.8 3.6 1.2 3.5 0.9 2.2 3.1 4.7 7.5 4.0

The ProbeName column represents the name of the probes from A1 to E1, the Gene column represents the name of the genes from A to E, and Columns D1 to D14 (variables) represent the expression values in different data points. Rows represent the expression value of a probe representing a particular gene in 14 different data points (i.e., how much a particular gene is activated in 14 different data points with the respective probes). A1, A2, A3, A4 & A5 represent multiple probes for the same gene A, and likewise for the other genes B, C, D, and E. In this Table, I want to compute the possible pairwise Pearson correlation of multiple probes for the same gene across 14 data points (D1 to D14). For Example, the possible probe combinations for gene C to compute Pearson correlation across 14 data points are

 1. C1 (D1:3.1, D2:6.1, D3:3.4, D4:2.5, D5:3.7, D6:3.3, D7:2.7, D8:5.0,
D9:2.3, D10:3.1, D11:2.0, D12:3.8, D13:2.6, D14:3.3) Vs C2 (D1:3.8,
D2:7.1, D3:4.8, D4:4.1, D5:4.9, D6:4.5, D7:3.8, D8:5.9, D9:4.0,
D10:4.7, D11:4.4, D12:5.1, D13:2.9, D14:4.8),
2. C1 (D1:3.1, D2:6.1, D3:3.4, D4:2.5, D5:3.7, D6:3.3, D7:2.7, D8:5.0,
D9:2.3, D10:3.1, D11:2.0, D12:3.8, D13:2.6, D14:3.3) Vs C3 (D1:3.8,
D2:6.1, D3:5.5, D4:5.4, D5:6.3, D6:3.9, D7:3.4, D8:7.8, D9:5.3,
D10:5.7, D11:4.8, D12:4.0, D13:3.5, D14:4.3),
3. C2 (D1:3.8, D2:7.1, D3:4.8, D4:4.1, D5:4.9, D6:4.5, D7:3.8, D8:5.9,
D9:4.0, D10:4.7, D11:4.4, D12:5.1, D13:2.9, D14:4.8) Vs C3 (D1:3.8,
D2:6.1, D3:5.5, D4:5.4, D5:6.3, D6:3.9, D7:3.4, D8:7.8, D9:5.3,
D10:5.7, D11:4.8, D12:4.0, D13:3.5, D14:4.3)


After generating the correlation matrix of the possible pairwise combinations of multiple probes for the same gene across 14 data points, I want to flatten only the upper or lower triangular correlation matrix and generate the output in CSV format as mentioned below.

ProbeName_1 ProbeName_2 Gene PearonCorrelationValue Pvalue
A1 A2 A -0.129 0.661
A1 A3 A -0.176 0.547
A1 A4 A -0.106 0.718
A1 A5 A -0.084 0.776
A2 A3 A 0.963 0.000
A2 A4 A 0.932 0.000
A2 A5 A 0.914 0.000
A3 A4 A 0.922 0.000
A3 A5 A 0.883 0.000
A4 A5 A 0.882 0.000
B1 B2 B -0.328 0.253
B1 B3 B 0.900 0.000
B1 B4 B 0.987 0.000
B2 B3 B -0.084 0.774
B2 B4 B -0.322 0.261
B3 B4 B 0.882 0.000
C1 C2 C 0.888 0.000
C1 C3 C 0.542 0.045
C2 C3 C 0.658 0.011
D1 D2 D 0.993 0.000

I do not know how to deal with this complex data with R . I humbly request the experts to help me with this problem.

Note: I do not want the correlation value of identical probe combinations i.e., A1 Vs A1 or A2 Vs A2 or A3 Vs A3 or A4 Vs A4 or A5 Vs A5. I also do not want to perform a pairwise combination of a probe of one gene with the probe of another different gene. i.e., A1 Vs B1, B2, B3, B4 or A1 Vs C1, C2, C3 or A1 Vs D1, D2, and or A1 Vs E1.

Pearson correlation R • 516 views