Question

Correlating Correlations: Conceptual statistics questions behind correlating multi-dimensional data

1

Entering edit mode

5.6 years ago

sako242 ▴ 20

Hi all,

I have an RNAseq and a proteomics dataset for patients along some disease spectrum vs control samples (n = 50+). My lab has a theory that as the disease worsens, the RNA:protein ratio for certain genes becomes larger and larger. That is, RNA goes up, but protein levels for the same gene decrease. My lab is somewhat limited in statistical understanding for bioinformatics, so I'd really appreciate any help or tips. Briefly, my general strategy to test this has been:

load RNAseq and proteomics datasets
individually log2 transform, batch normalize, z-score
limit to genes found in every patient
correlate RNA:protein ratio per gene across disease conditions as desired
correlate correlations as disease progresses for continuous variables, compare for discrete

This results in ~2300 genes common to both datasets found in each patient sample, largely limited to the proteomic depth. However, I'm getting lost in ensuring I'm doing the right statistics.

For example, when I make my RNA:protein correlations, I'm doing 2300 correlations for every patient (hundreds total). Do these correlation's p values matter for my question? Do I adjust them? What should my cut off be? My p value histograms reveal some strong effects FDR adjusted p val vs non-adjusted, which is promising.

Ultimately, I need to assess the RNA:protein correlation r value between disease progression groups. In the simplest example, I have "diseased" vs "control" groups. I can consolidate the data to form lists of RNA:protein r correlation coefficient per gene between my diseased and control groups. Do I then do an unpaired t-test per gene, and find which correlation values differ significantly? Should I limit to only the genes with correlations that are significant, or can I get away with assessing every r value (regardless the statistical chance that coefficient is due to chance) and have my final adjusted t-test be my statistical barrier?

Thanks for any help with this.

RNA-Seq proteomics correlation • 1.3k views

ADD COMMENT • link updated 5.6 years ago by Jean-Karim Heriche 27k • written 5.6 years ago by sako242 ▴ 20

score 3 · Answer 1 · 2019-12-18

3

Entering edit mode

5.6 years ago

Jean-Karim Heriche 27k

You can test for the difference in correlation coefficients between control and disease. However, testing for differences in correlations involves applying Fisher's transformation to the two correlations and then doing a t-test. This is implemented in the r.test() function in the R package psych.
I think it should also be possible and maybe preferable to use a regression model to quantify how the RNA/protein ratio predicts the disease level.

ADD COMMENT • link 5.6 years ago by Jean-Karim Heriche 27k

1

Entering edit mode

The psych package was exactly what I was looking for. Thanks for your help. As it turned out, not many genes passed the significance threshold once I corrected for multiple comparisons.

ADD REPLY • link 5.3 years ago by sako242 ▴ 20