Hi all,

I need some advice. I am doing an analysis to look for an association between methylation and stress. I use lmfit model. My stress variable is continuous and I am not comparing any stress or control groups.

An example of my code:- design <- model.matrix(~ stress + age + gender + smoking + Plate , pheno)

sv <- sva(getM(eset.nosex), design, vfilter = 5e4) ##surrogate varaibles design <- cbind(design, sv$sv) fit <- lmFit(getM(eset.nosex), design) fit <- eBayes(fit)

names(fit)

# [1] "coefficients" "rank" "assign" "qr"

# [5] "df.residual" "sigma" "cov.coefficients" "stdev.unscaled"

# [9] "pivot" "Amean" "method" "design"

# [13] "df.prior" "s2.prior" "var.prior" "proportion"

# [17] "s2.post" "t" "df.total" "p.value"

# [21] "lods" "F" "F.p.value"

topTable(fit, 2)

# logFC AveExpr t P.Value adj.P.Val B

# cg22700843 -0.003161108 4.139965 -6.653562 1.448413e-09 0.0004906854 8.411029

# cg13026838 -0.004696756 2.379085 -6.652539 1.455450e-09 0.0004906854 8.406237

# cg01495737 -0.002898319 3.515948 -6.605255 1.820603e-09 0.0004906854 8.184942

# cg18390621 -0.006030697 5.923849 -6.292316 7.889849e-09 0.0015948422 6.736955

# cg26562879 -0.002665734 3.549174 -6.132382 1.651324e-08 0.0026703692 6.008840

# cg00198066 -0.003245212 3.618281 -5.942723 3.923888e-08 0.0052877921 5.156740

# cg24874350 0.007544302 -2.733933 5.898567 4.791739e-08 0.0055348287 4.960210

# cg20905655 0.004923986 2.635893 5.812363 7.064432e-08 0.0071399681 4.578617

How would one interpret log fold change in terms of methylation (especially when there is no group comparison and only an association study)? Arent estimates(coefficients) a better measurement for methylation in association studies?

thankyou!

Thank you for your reply.

Do you mean the coefficients or the estimates?

Yes, I did check fit$coefficients. It gives coefficients for each variable in the design matrix.

head(fit$coefficients)

cg14817997 1.3965045 0.02149242 0.052808421 0.02321043 cg26928153 2.3554372 -0.02522342 -0.036111524 0.04056885 cg16269199 0.0455470 0.02982565 -0.114791997 0.06911121 cg13869341 0.2278520 0.05470675 -0.187123556 0.07042281 cg14008030 0.2342564 0.07319959 0.001113212 0.02870625 cg12045430 -3.7704926 -0.06934915 0.168677371 -0.03177391 Smoking Plate2 cg14817997 -0.16372064 0.6785156 cg26928153 -0.04224105 -0.5071137 cg16269199 -0.05919517 -0.8600021 cg13869341 -0.32086356 0.6786138 cg14008030 -0.23856418 2.3189791 cg12045430 0.23110099 -1.7975213

But what I would like to know is just for the variable of interest:- stress

topTable(fit2, 2)

cg13547817 -0.3738731 3.5772747 -5.624865 1.955557e-07 0.1581173 5.915087 cg16692227 -0.3559524 0.9775661 -5.455437 4.043169e-07 0.1634560 5.334972 cg03287633 -0.1867909 3.0880473 -5.290847 8.107974e-07 0.1738228 4.778687 cg17474383 -0.2511642 -0.0911020 -5.226623 1.060811e-06 0.1738228 4.563692 cg21858516 -0.2169699 -0.7935851 -5.223461 1.074899e-06 0.1738228 4.553137 cg16294325 -0.2948937 3.3952271 -5.110209 1.719679e-06 0.2317423 4.177111

Usually, when you run a normal lm model (without lmfit) we get estimates, std. error, t value, and pvalue. Which is pretty easy to interpret.

Yes, true beta values are better for biological interpretation, but after running the model in lmfit (using m values) I can convert to beta values and that should work right?

meth_beta = topTable(fit2, 2 , sort.by="p", number = nrow(getM(eset.nosex)))

thanks again.

Yes, I meant that, if you fit a continuous covariate (e.g. stress), the

`logFC`

column actually gives the same (estimates/coefficients) as those estimates/coefficients contained in`fit$coefficients`

. To check, reorder the 2 objects (the`$logFC`

column from`topTable`

and the`fit$coefficients`

) to have the ordered CpG sites in the same objects and see that they are the same numbers: i.e. the`$logFC`

, if you extract the second coefficient, should be the same as the`Stress`

column in`fit`

)(I just did a

`lmFit`

in my own data for a continuous covariate (age) and checked that the`logFC`

contained the same as the age coefficient in`fit$coefficients`

).Unless we're using different terminology, I was referring to the coefficients/estimates as the same thing (they represent the mean of change in Y for unit of change in X, thus being a measure of effect size; if you had factors, they would be the means or changes in means between groups).

Regarding going "back" from M-values to b-values, I don't think you can convert coefficients in M-value scale to coefficients in b-value scale, because the M- and b-value relationship is non-linear. For example, a change from 0 to 2 in M-value can be a 30% change in beta, while from 2 to 4, the beta change is much less (see Fig. 1 here).

I would rather directly use the original beta values. Maybe you could also estimate the coefficients using beta values (with

`lmFit`

) and use those for ranking or filtering CpGs by how much they change in beta value, if you want a more biological interpretation, while using the p-values from the tests done with the M-values.Thank you!

Yes, I did check the $logfc measurements and fir$coefficients and they are the same. That helps a lot.

yes, we cannot change the m-value coefficients to coefficients on the b-value scale. I did read the paper and they too advise using m-values for the analysis and b-values for the biological interpretation. I do use the beta values for plotting. If we need b-value-based coefficients then that would mean running the analysis with b-values and that doesn't make sense. It like a loop.

But thanks a ton, the logfc was a great help.

Yes, I would also do everything on M-values. The reason I suggested also fitting the analyses on b-values is because:

1) with lmFit you have a very fast way of obtaining stress coefficients in b-value scale, which you could us as (biologically interpretable) measures of change

2) because you use other covariates in the models (e.g. gender, age, smoking...), these stress coefficients will be adjusted for these variables and will reflect better what you are really testing in your models. This is because when you do the models you are not actually looking at changes in "raw" b-values, or M-values, in a sense. In these models the effects (the coefficients) of your variable of interest are "adjusted" for the other covariates. If you only looked at the "raw" b-values it would not be exactly the same that you were actually testing within limma.

That makes very good sense, and very understandable. I would prefer to use m-values too. I also find it easier to work with limma. I just feel it needs to be more 'methylation-friendly' (as in more guides and more examples). Limma is more expression array-based.

Thank you so much !