Diff exp analysis of 450k methylation data, what is appropriate, logistic or linear regressions?
1
1
Entering edit mode
7.7 years ago

Hi,

I have about 75 methylation profiles from diseased subjects that differ in several variables, e.g. a continuous variable that indicates disease severity. There are no classical different groups. The data stem from Illumina 450k arrays. I have done QC, normalization etc. with minfi, and ended up with a matrix of beta values.

I looked into different ways to assess differential methylation related to the different variables. I am, however, unsure what the most approporate way to tackle this problem could be.

I am concerned about the distribution of betas that from what I remember renders this problem unsuitable for normal linear methods, so I cannot use limma or just a lot of linear models. Or am I mistaken?

beta ~ variable1 + variable2 + (1|subject) (450k times, possibly inappropriate?)

An alternative way would be to 1/0 the data, e.g. by calling every beta below 0.5 'unmethylated', and all above 'methylated'. Thus, I could use (a lot of) logistic regressions to check for variables related to methylation. Still, this approach would loose me quite a lot of detail.

Methylated(0,1) ~ variable1 + variable2 + (1|subject) (450k times, takes a long time)

What do you think?

Many thanks!

R methylation • 2.3k views
ADD COMMENT
1
Entering edit mode
7.7 years ago

I would start by restricting my focus to a subset of loci whose beta values show the highest degree of between-samples variance and make some heatmaps with your variables as annotations. Changing up what how many loci you filter out and what sort of clustering method/distance you use will help you see how robust the associations you find really are. Actually, I would probably try to run a PCA before any of that.

Also, I've always subset my beta values as:

  • 0-30 = 'unmethylated'
  • 31-70 = 'low methylation'
  • 71-100 = 'high methylation'

...but thats just what I've been told people do; sorry I don't have a reference to point you to.

Are you using methylKit or some other packages?

ADD COMMENT
0
Entering edit mode

Many thanks for your input, probably very smart to limit the number of probes. Will definitely do so. Also helps a lot with the Bonferroni correction. So you would go for a logistic regression, but with three levels? I am using minfi at the moment to read in the data and get betas, and take it from there with handmade code.

ADD REPLY
0
Entering edit mode

Yes, a three-level logistic regression on a subset of highly-variable probes seems sensible although I've never done it as formally as all that. My approach has generally been to: 1. Sort loci by decreasing between-samples variance 2. Select the most-variant 5, 10, 25% of loci 3. Plot a heatmap with hierarchical clustering applied to both the loci and the samples 4. Compare several heatmaps and take note of which associations between clusters of samples and clinical features persist across multiple resolutions.

In the case of continuous clinical variables, it can be tricky because the samples sometimes cluster quite differently with slight changes in the number probes you consider. My experience has been that if you take enough subsets of the most-variant loci you'll soon see consistent groupings of at least most of your samples - but choosing to focus on a particular subset of probes (ex: 17% most-variant loci) because they associate nicely with your clinical variable(s) is obviously problematic.

Another thing you may want to consider is to exclude loci that show high methylation in your normal samples - not sure how appropriate that would be for your research questions but I have mostly used methylation data to look at tumors from a population showing unusual etiology and its made finding associations easier.

ADD REPLY

Login before adding your answer.

Traffic: 2562 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6