I have an RNAseq and a proteomics dataset for patients along some disease spectrum vs control samples (n = 50+). My lab has a theory that as the disease worsens, the RNA:protein ratio for certain genes becomes larger and larger. That is, RNA goes up, but protein levels for the same gene decrease. My lab is somewhat limited in statistical understanding for bioinformatics, so I'd really appreciate any help or tips. Briefly, my general strategy to test this has been:
- load RNAseq and proteomics datasets
- individually log2 transform, batch normalize, z-score
- limit to genes found in every patient
- correlate RNA:protein ratio per gene across disease conditions as desired
- correlate correlations as disease progresses for continuous variables, compare for discrete
This results in ~2300 genes common to both datasets found in each patient sample, largely limited to the proteomic depth. However, I'm getting lost in ensuring I'm doing the right statistics.
For example, when I make my RNA:protein correlations, I'm doing 2300 correlations for every patient (hundreds total). Do these correlation's p values matter for my question? Do I adjust them? What should my cut off be? My p value histograms reveal some strong effects FDR adjusted p val vs non-adjusted, which is promising.
Ultimately, I need to assess the RNA:protein correlation r value between disease progression groups. In the simplest example, I have "diseased" vs "control" groups. I can consolidate the data to form lists of RNA:protein r correlation coefficient per gene between my diseased and control groups. Do I then do an unpaired t-test per gene, and find which correlation values differ significantly? Should I limit to only the genes with correlations that are significant, or can I get away with assessing every r value (regardless the statistical chance that coefficient is due to chance) and have my final adjusted t-test be my statistical barrier?
Thanks for any help with this.