Hi Sven,

Seems like a very interesting project. I would do the following:

- prune your SNP dataset based on linkage disequilibrium (LD) so that
you are only looking at the most informative SNPs and also to reduce
your variable load (OPTIONAL)
- for each methylation region, take SNPs within a defined window
surrounding the region and test each independently
- for each methylation region, take the statistically significant SNPs
and put those in the final model
- reduce the final model further through stepwise regression (OPTIONAL)
- test the final reduced model's robustness via r-squared shrinkage,
ROC analysis, and cross-validation

## ----------------------------------------------

In part 2, when I say 'test each independently', I mean:

```
glm(meth% ~ SNP1)
glm(meth% ~ SNP2)
glm(meth% ~ SNP3)
et cetera
```

In part 3, if `SNP2`

, `SNP3`

, `SNP8`

, and `SNP9`

were your statistically significant SNPs, then the final model would be:

```
final <- glm(meth% ~ SNP2 + SNP3 + SNP8 + SNP9)
```

Regarding your SNP encoding, you can have these as:

- continuous variables (counts of minor alleles)
- categorical variables (HomMinor, HomMajor, Het)

Regarding your outcome, you can equally encode this as continuous or categorical.

Instead of glm, you could also do lasso-penalised regression. You can also build multiple models in various ways and then compare them, as I do here:

I go over more on these things here:

There's a lot of other material on Biostars and elsewhere, too.

Kevin

Is there any particular reason or hypothesis suggesting that SNPs' effects on DNA methylation are local? I would guess methylation status in a region could well be influenced by variants very far away?

That is true, Vitis.