Question

Finding Correlation between methylation and RNA-Seq data and their significance

0

Entering edit mode

6.8 years ago

noorpratap.singh ▴ 330

I had a couple of questions.

First - Is there any specific reason to pick spearman or pearson correlation? Which one is usually the more preferred one and why?

Second - For computing significance of the correlation, a common method employed is to compute the t-statistic from the correlation value and then compute the p-value based on t-distribution. But my question doesn't the underlying quantity need to be normally distributed. Any clarity on the statistics side would be helpful.

correlation RNA-Seq methylation • 1.9k views

ADD COMMENT • link updated 6.8 years ago by Charles Warden 8.3k • written 6.8 years ago by noorpratap.singh ▴ 330

score 2 · Answer 1 · 2018-10-02

The Spearman Correlation is typically used because it is non-parametric. I'm not sure how much it would matter in this case, but (in the very general sense) you may sometimes prefer a metric that takes the magnitude of difference (not just ranking) into consideration. For example, if expression is similar but rankings can vary in a way that isn't biologically meaningful (particularly among genes with lower expression), my personal opinion is that use of the Pearson Correlation may be preferable (even if there are theoretical arguments about the normality assumption).

I believe you are describing the strategy for p-value calculation in the R cor.test() function. Again, you are right about there being certain assumptions (and there may be overall strategies that are in fact better in a particular circumstance); however, each individual feature may be more normally distributed among control samples than the overall set of values for a given sample (if you are talking about a per-feature test). If you had 100% methylation with low expression and 0% methylation with high expression (in a 2-group comparison, with good concordance between replicates), then you would also have a strong negative correlation (even though the overall methylation distribution for that feature is bimodal); however, if you have outlier samples, it is possible you may want to use some other sort of test/score (although, if you already have filtered for differential methylation, that should help with replicate concordance on one end).

To be clear, it is very possible that you can use something different than a correlation that works better for your particular project. However, my opinion is that you may find the identification of candidates reasonable (even with what you are describing, with assumptions of normality that aren't exactly met, particularly if performing one test for the distribution per sample).

In the case of differential methylation (particularly with BS-Seq data), you may find that a standard statistical test on percent methylation may have relatively low power (in which case, differences would be even less significant with the non-parametric test, or maybe even less significant for certain more complicated tests with a beta-binomial distribution, or the glm() logistic regression p-value for percent methylation). This may not necessarily be bad (if you want clear differences with good concordance between replicates). However, if you focus on the normality assumption to the extent that you define a methylation distribution that looks very different than your original signal (particularly if it makes some large methylation differences less significant, and small differences near 0% and 100% more significant), then I think it is at least worth seeing what differences with more direct measurements look like (so, comparing methods with different strategies for your project). I also think visualizing the percent methylation values is worthwhile, even if you have a p-value calculated with some sort of transformation (or count-based test).