Correlating Correlations: Conceptual statistics questions behind correlating multi-dimensional data
1
1
Entering edit mode
4.3 years ago
sako242 ▴ 20

Hi all,

I have an RNAseq and a proteomics dataset for patients along some disease spectrum vs control samples (n = 50+). My lab has a theory that as the disease worsens, the RNA:protein ratio for certain genes becomes larger and larger. That is, RNA goes up, but protein levels for the same gene decrease. My lab is somewhat limited in statistical understanding for bioinformatics, so I'd really appreciate any help or tips. Briefly, my general strategy to test this has been:

  • load RNAseq and proteomics datasets
  • individually log2 transform, batch normalize, z-score
  • limit to genes found in every patient
  • correlate RNA:protein ratio per gene across disease conditions as desired
  • correlate correlations as disease progresses for continuous variables, compare for discrete

This results in ~2300 genes common to both datasets found in each patient sample, largely limited to the proteomic depth. However, I'm getting lost in ensuring I'm doing the right statistics.

For example, when I make my RNA:protein correlations, I'm doing 2300 correlations for every patient (hundreds total). Do these correlation's p values matter for my question? Do I adjust them? What should my cut off be? My p value histograms reveal some strong effects FDR adjusted p val vs non-adjusted, which is promising.

Ultimately, I need to assess the RNA:protein correlation r value between disease progression groups. In the simplest example, I have "diseased" vs "control" groups. I can consolidate the data to form lists of RNA:protein r correlation coefficient per gene between my diseased and control groups. Do I then do an unpaired t-test per gene, and find which correlation values differ significantly? Should I limit to only the genes with correlations that are significant, or can I get away with assessing every r value (regardless the statistical chance that coefficient is due to chance) and have my final adjusted t-test be my statistical barrier?

Thanks for any help with this.

RNA-Seq proteomics correlation • 931 views
ADD COMMENT
3
Entering edit mode
4.3 years ago

You can test for the difference in correlation coefficients between control and disease. However, testing for differences in correlations involves applying Fisher's transformation to the two correlations and then doing a t-test. This is implemented in the r.test() function in the R package psych.
I think it should also be possible and maybe preferable to use a regression model to quantify how the RNA/protein ratio predicts the disease level.

ADD COMMENT
1
Entering edit mode

The psych package was exactly what I was looking for. Thanks for your help. As it turned out, not many genes passed the significance threshold once I corrected for multiple comparisons.

ADD REPLY

Login before adding your answer.

Traffic: 2031 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6