Question: Correlating Correlations: Conceptual statistics questions behind correlating multi-dimensional data
gravatar for sako242
14 months ago by
sako24220 wrote:

Hi all,

I have an RNAseq and a proteomics dataset for patients along some disease spectrum vs control samples (n = 50+). My lab has a theory that as the disease worsens, the RNA:protein ratio for certain genes becomes larger and larger. That is, RNA goes up, but protein levels for the same gene decrease. My lab is somewhat limited in statistical understanding for bioinformatics, so I'd really appreciate any help or tips. Briefly, my general strategy to test this has been:

  • load RNAseq and proteomics datasets
  • individually log2 transform, batch normalize, z-score
  • limit to genes found in every patient
  • correlate RNA:protein ratio per gene across disease conditions as desired
  • correlate correlations as disease progresses for continuous variables, compare for discrete

This results in ~2300 genes common to both datasets found in each patient sample, largely limited to the proteomic depth. However, I'm getting lost in ensuring I'm doing the right statistics.

For example, when I make my RNA:protein correlations, I'm doing 2300 correlations for every patient (hundreds total). Do these correlation's p values matter for my question? Do I adjust them? What should my cut off be? My p value histograms reveal some strong effects FDR adjusted p val vs non-adjusted, which is promising.

Ultimately, I need to assess the RNA:protein correlation r value between disease progression groups. In the simplest example, I have "diseased" vs "control" groups. I can consolidate the data to form lists of RNA:protein r correlation coefficient per gene between my diseased and control groups. Do I then do an unpaired t-test per gene, and find which correlation values differ significantly? Should I limit to only the genes with correlations that are significant, or can I get away with assessing every r value (regardless the statistical chance that coefficient is due to chance) and have my final adjusted t-test be my statistical barrier?

Thanks for any help with this.

ADD COMMENTlink modified 14 months ago by Jean-Karim Heriche24k • written 14 months ago by sako24220
gravatar for Jean-Karim Heriche
14 months ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche24k wrote:

You can test for the difference in correlation coefficients between control and disease. However, testing for differences in correlations involves applying Fisher's transformation to the two correlations and then doing a t-test. This is implemented in the r.test() function in the R package psych.
I think it should also be possible and maybe preferable to use a regression model to quantify how the RNA/protein ratio predicts the disease level.

ADD COMMENTlink written 14 months ago by Jean-Karim Heriche24k

The psych package was exactly what I was looking for. Thanks for your help. As it turned out, not many genes passed the significance threshold once I corrected for multiple comparisons.

ADD REPLYlink written 10 months ago by sako24220
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1148 users visited in the last hour