Question

robust linear regression with interaction term

1

Entering edit mode

7 weeks ago

Bioinfonext ▴ 460

Hi,

I am looking how epigenetic age associated with toxic element exposure in four different road buffer zone (1000m, 2000m, 3000m, &4000m) using robust linear regression. I can run robust linear regression without interaction term, but could you please help me how to run robust linear regression with interaction term (Hg*buffer zone). I have tried below to run robust regression with interaction term but I am not getting 1000m buffer zone interaction with Hg and also I am not getting Pvalue.

summary(rr.huber <- rlm(PCGrimAge_IEAA ~ 0+Hg*buffer.zone + Sex +Age+ Smoking_Status, data = ph1))

Call: rlm(formula = PCGrimAge_IEAA ~ 0 + Hg * buffer.zone + 
    Sex + Age + Smoking_Status, data = ph1)
Residuals:
     Min       1Q   Median       3Q      Max 
-7.68190 -1.59376 -0.01048  1.58263 16.07211 

Coefficients:
                               Value    Std. Error t value 
           Hg                   -0.2404   0.3345    -0.7187
buffer.zone1000m                -0.4037   0.5335    -0.7567
buffer.zone2000m                 0.0406   0.7010     0.0579
buffer.zone3000m                 -0.7888   0.4695    -1.6799
buffer.zone4000m                 -1.0178   0.5868    -1.7344
Sex                             -1.9878   0.1239   -16.0398
Age                              0.0011   0.0066     0.1720
Smoking_Status                   2.7252   0.0922    29.5725
Hg:buffer.zone2000m             -0.6463   0.8488    -0.7614
Hg:buffer.zone3000m              0.2163   0.3691     0.5860
Hg:buffer.zone4000m              0.5614   0.5351     1.0491

Residual standard error: 2.352 on 1801 degrees of freedom
  (1 observation deleted due to missingness)

R epigenetic biostatistics statistics • 291 views

ADD COMMENT • link updated 7 weeks ago by Jeremy ▴ 910 • written 7 weeks ago by Bioinfonext ▴ 460

score 0 · Answer 1 · 2024-03-06

For p-values, see a similar question on Stack Exchange here:

p-values in rlm()

For the "missing" interaction term, here is an answer from ChatGPT:

When you use a categorical variable in a regression model in R (such as through the rlm() function from the MASS package, which fits robust linear models), R automatically converts it into a set of dummy variables for the levels of the factor, excluding one level to serve as a reference (or baseline). This is known as "dummy coding" or "one-hot encoding." The level that is excluded is typically the first level (alphabetically or numerically, depending on the factor's values), unless otherwise specified.

In your case, buffer.zone is a categorical feature with 4 classes: 1000m, 2000m, 3000m, 4000m. When you include it in the model along with an interaction term, R converts buffer.zone into dummy variables, with one of the buffer zones (usually the first one, buffer.zone1000m) being left out as the reference category. Therefore, in your model, the effects of buffer.zone2000m, buffer.zone3000m, and buffer.zone4000m are measured relative to buffer.zone1000m.

The reason you don't see Hg:buffer.zone1000m in your summary output is that buffer.zone1000m is the reference level against which the other buffer zones are compared. Hence, the interaction terms you see (Hg:buffer.zone2000m, Hg:buffer.zone3000m, and Hg:buffer.zone4000m) represent the interaction effects of Hg with each of those buffer zones relative to the reference buffer zone (buffer.zone1000m).

If you want to explicitly include interactions with buffer.zone1000m or use a different buffer zone as the reference, you can relevel your factor buffer.zone before fitting the model.